Overview
370 Architecture - Huge Picture

Required elements
1. CPU
2. Storage
3. I/O subsystem, including devices
4. Console

Optional elements
1. additional CPUs for multi-processing
2. Vector elements
3. Expanded storage
SONA Overview - SS System

CPU

I-unit
E-unit
S-unit

System Storage

System Controller

System Data Switch

Expanded Store

Main Store

I/O Subsystem

Customer Devices

Service Processor

Customer Devices

Scan

Scan

Scan
SONA Overview - SS System

- **CPU**
  - Up to 4 on a side (QP).
  - Each CPU includes I, E, and S units.
  - CPU fits on an MLG.

- **System storage**
  - Focal point for data traffic.
  - Includes Main Store and Expanded Store.
  - CPUs talk to System Storage, not to each other.
  - SC and SDS are each an MLG.
  - Main Store and Expanded store are implemented in ET technology on BLCs.

- **I/O Subsystem**
  - Gateway to the real world.
  - Linked to System Storage.
  - 1 MLG per IOP.
  - QDIH's (BLCs, ET) provide actual interfaces to devices.

- **Service Processor**
  - Stand alone computer system.
  - Has its own devices, including hard disks and terminals.
  - Communicates with mainframe via scan.
  - Implemented in ET technology on BLCs.
1. Program Status Word (PSW) is starting point
   • 64 bit register containing various parameters for instruction execution.
     - Current Instruction Address.
     - Dynamic Address Translation (DAT) mode: enables DAT (virtual addressing).
     - Condition Code (CC): result of prior operation. Used in conditional branch.
     - Key: 4 bits compared against storage key to verify that a storage access is OK.
     - Rupt/exception masks: enable/disable various interrupts.

2. Translate the instruction address
   • if DAT is on, the address needs to be translated to an Absolute Address.

3. Fetch the instruction
   • Includes an opcode and various fields (described later).

4. For storage accesses:
   • Generate theOperand Effective Address. Fields in the instruction point to GPRs, which are used in Effective Address Generation (EAG).
   • If DAT is enabled, translate the address to an Absolute Address.
   • The Absolute Address is used to fetch the operand from storage.

5. For register operands:
   • registers include:
     - General Purpose Registers: most common registers, used for most operations.
     - Floating Point Registers: used in floating point operations.
     - PSW: see above.
     - Control Registers: contain a variety of parameters, many used in DAT.
   • a field in the instruction points to the register.

6. Send the operands to the ALU for processing.

7. Store the results away in either storage or a register.
<table>
<thead>
<tr>
<th>GPR #</th>
<th>4 bytes (1 word)</th>
<th>Even/odd pair</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td></td>
<td></td>
</tr>
<tr>
<td>9</td>
<td></td>
<td></td>
</tr>
<tr>
<td>A</td>
<td></td>
<td></td>
</tr>
<tr>
<td>B</td>
<td></td>
<td></td>
</tr>
<tr>
<td>C</td>
<td></td>
<td></td>
</tr>
<tr>
<td>D</td>
<td></td>
<td></td>
</tr>
<tr>
<td>E</td>
<td></td>
<td></td>
</tr>
<tr>
<td>F</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
General Purpose Registers

• **General Purpose Registers.**
  - Also called just General Registers in POO.

• **Focal point for a lot of processing.**
  - Contain general data results.
  - Also can contain addresses (see EAG).

• **16 of them.**
  - Four bit field indicates which register to use.

• **Four bytes wide = 1 word.**

• **Can be addressed in even/odd pairs to do double-word operations.**
  - Register field points to even register. Second (odd) register is implied by opcode.
Effective Address Generation

GPRs

Index '2' -> 00001234

Base '9' -> 56789ABC

Displacement 'DEF' + 00001234 + 56789ABC + DEF = 5678BADF

Operand Effective Address
Effective Address Generation

- **Uses two GPRs, Index and Base.**
  - Index GPR pointed to by X field of the instruction.
  - Base GPR pointed to by B field of the instruction.

- **Adds them in with 12 bit (right aligned) Displacement.**
  - Displacement is also a field of the instruction.
Sample Instruction Formats

• **RR**
  - Operates on two registers.
  - Typically, both registers are inputs to the operation and the result is stored in R1.

• **RX**
  - Operates on a register (R1) and a storage location (X2, B2, D2).
  - Typically, both operands are inputs to the operation and the result is stored in R1.
  - X2, B2, and D2 are inputs to EAG to generate the address of the first byte of data.

• **SS**
  - Storage to storage.
  - Typically, two fields of storage data form the operands.
  - Results are usually stored to the first operand.
  - The L fields indicate the length of each operand.
  - EAG (without an index) points to the first byte in each field.

• **SI**
  - Storage-Immediate.
  - One operand is storage, the other is a field in the instruction itself.

• **xxE (e.g. RRE)**
  - Extended (2 byte) opcode.
  - First byte points to a "family" of related opcodes.

• **Note:**
  - Instructions are 1, 2, or 3 half-words. Must be half-word aligned.
  - Register addresses are nibbles.
  - Displacement is 12 bits.
Exercise

PSW

\[
\begin{array}{cccccccc}
0 & 0 & 0 & 9 & 0 & 0 & 0 & 0 \\
\end{array}
\]

GPR 1

\[
\begin{array}{cccc}
0 & 0 & 0 & 10 \\
\end{array}
\]

GPR 2

\[
\begin{array}{cccc}
0 & 0 & 0 & 0 \\
\end{array}
\]

GPR 3

\[
\begin{array}{cccc}
FF & 0 & F & FF \\
\end{array}
\]

GPR 4

\[
\begin{array}{cccc}
0 & 0 & 0 & 0 \\
\end{array}
\]

GPR 6

\[
\begin{array}{cccc}
0 & 0 & 0 & 0 \\
\end{array}
\]

Storage

\[
\begin{array}{cccc}
0FF0 & 49 & 32 & 00 \\
0FF4 & 37 & FC & A0 \\
0FF8 & 54 & 9C & FF \\
0FFC & 47 & 32 & FF \\
1000 & 54 & 32 & 10 \\
1004 & 50 & 32 & 10 \\
1008 & 47 & F2 & 10 \\
100C & 1A & 44 & 1B \\
1010 & 47 & 32 & 10 \\
1014 & 50 & 42 & 20 \\
1018 & 47 & F2 & 10 \\
101C & FC & FF & FC \\
1020 & 21 & 32 & 48 \\
1024 & ED & B9 & 45 \\
\end{array}
\]
Architecture Summary

Key points of architecture.

- P20 has Instruction Address, this contains code.
- 16 bytes for 6 ops.
- EA bits (1011, 1010, 1001) for op codes.
- Z bits length: instruction w 1-2 byte operand.
- 2 reference 16, 32, 64, 96, 128.
- 32 reference 64, 96, 128.
SONA Pipeline Overview

SONA Pipeline

D   Decode instruction to generate controls. Do Effective Address Generation.
A   Address sent to the S-unit.
T   TAG/TLB access.
B   Buffer (cache) access.
X   Execute the instruction.
W   Write the results back.

Notes

• This pipeline is interlocked.
• Two cycles are not shown - they're I-unit constructs and are generally transparent to the rest of the machine. They're not part of the "official" pipe (at least in my view).
  C   Control store access on second and subsequent flows. In front of D.
  Z   When data is really written to the GPRs and other registers. After W.

Key Registers

Instruction Data Register - holds 4 bytes of instruction.
Operand Word Register - contains 8 bytes (doubleword) of operand data.
Result Register - contains 8 bytes of operation results.

S-unit Pipeline

P   Initial priority cycle.
A   Address selection (based on final priority)
T   TAG/TLB access.
B   Buffer (cache) access.
R   Data (results) clocked into OWR.

Note

• The S-unit pipeline is free-running.
Branches

Branch Processing
- Goes through EAG like any other RX instruction. \((D,A \text{ cycle})\)
- Address sent to TAGs/TLB and Buffer. \((T, B \text{ cycles})\)
  * Note: non-branch instructions will access \(OP\) Tags and buffers, whereas a branch accesses \(IF\) Tags and Buffer. The timing of these two accesses is the same.
- New instruction data is loaded. \((X \text{ cycle})\)
  * Note: non-branch instructions will load \(OP\) data into the \(OWR\), whereas a branch will load \(IF\) data into the \(IDR\).
- Since D cycle of target lines up with X cycle of branch, there's a 3 cycle \textit{branch penalty} incurred on taken branches.

Branches and performance
- The machine cycle time can't be faster than the raw branch path delay \((IDR \Rightarrow EAG \Rightarrow TAG/TLB/Cache \Rightarrow IDR)\) divided by the number of cycles in this path.
  * SONA divides this over 4 cycles - DATB.
  * 5890 divided this over 3 cycles - DAB.
  * 580 divided if over 2 cycles - GB.
- the down side of having more cycles is:
  * ________________________________
  * ________________________________
  * ________________________________
I-unit
Basic Blocks

• Instruction Fetch
  - Maintains a queue of instructions.
  - Uses queue to keep ____ filled.

• Instruction Data Register (IDR)
  - Primary D-cycle platform.
  - Holds 4 bytes of the current instruction.

• Effective Address Generation (EAG)
  - Adds index, base, and displacement to generate the operand effective address.

• Register Array (RA)
  - A variety of registers, including the GPRs.

• Timer Complex
  - Includes a Time of Day clock and facilities to count time intervals (for time slicing).

• Control Store
  - Generates control points for the pipe.
  - Combination of µcode and hard-wired control.

• Interlock Analysis

• Process Control
  - State machine to control process switching (interrupts).

• Non I-unit stuff
  - OP cache (a.k.a. buffer) in S-unit.
  - E-unit, including OWR and RR.
I-fetch Data Paths

Fetch Data Registers

Instruction Buffers

Instruction Data Register

AMDAHL INTERNAL USE ONLY

Rev. 1.591
I-fetch Data Paths

I-fetch charter

Function of paths (excluding FDRs)

- If cache:
  - Load IDR
  - Local TBO JB1

- IB0:
  -
  - IB byte queue m1, ID1 $h_{15..1}$ Q

- IB1:
  - Dephush IF1
  - To 2 byte IB1 from B1K2 of ID1
  - act as c68k

• Note:
  - the IDR is only 4 bytes. For 6 byte instructions, we start out with the first 4 bytes, which are enough to generate the first address. Once this is done, the third HW overclocks the 2nd HW, allowing us to then generate the second address.
I-fetch Data Paths (cont.)

Branch Processing
- Instructions that set the CC will do so in 1 of 3 cycles:
  * Early Setters: X
  * Normal Setters: W
  * Late Setters: Z
- Subsequent branch instructions can't make the branch decision (what to load into the IDR) until the CC has been set. Thus, the branch penalty is increased for Normal and Late CC setters.

<table>
<thead>
<tr>
<th>CC Setter Timings</th>
</tr>
</thead>
<tbody>
<tr>
<td>Setting Instr</td>
</tr>
<tr>
<td>Early CC Setter</td>
</tr>
<tr>
<td>Normal CC Setter</td>
</tr>
<tr>
<td>Late CC Setter</td>
</tr>
<tr>
<td>Branch</td>
</tr>
<tr>
<td>SU Flow</td>
</tr>
<tr>
<td>Target (Early CC)</td>
</tr>
<tr>
<td>Target (Norm CC)</td>
</tr>
<tr>
<td>Target (Late CC)</td>
</tr>
</tbody>
</table>

- Since the S-unit is free-running, it may return the data before the branch decision has been resolved. Until then, the data needs to be stored somewhere.
  * I-fetch provides Fetch Data Registers (16 bytes each) to hold target instruction data until the branch is resolved.
  * Two FDRs provide enough for worst case (late CC setter followed by 3 branches).

<table>
<thead>
<tr>
<th>Multiple Branches</th>
</tr>
</thead>
<tbody>
<tr>
<td>Setting Instr</td>
</tr>
<tr>
<td>CC SET</td>
</tr>
<tr>
<td>BRANCH 1</td>
</tr>
<tr>
<td>SU Flow</td>
</tr>
<tr>
<td>FDR 0</td>
</tr>
<tr>
<td>BRANCH 2</td>
</tr>
<tr>
<td>SU Flow</td>
</tr>
<tr>
<td>FDR 1</td>
</tr>
<tr>
<td>BRANCH 3</td>
</tr>
<tr>
<td>SU Flow</td>
</tr>
</tbody>
</table>
Non-taken Branches

Target Hedge Registers

Current Trgt Address

BAT

Next Taken Trgt Address (predicted)

IF

Predicted Trgt Data

Recursion

BAT

Next Taken Trgt Address (predicted)

IF

Predicted Trgt Data

...
Branch Address Table Concept

Goal: Whenever a branch is encountered, it takes a while to fetch the target data and resolve the branch decision. In the meantime you’d like to keep the pipe busy. In the current design IF guesses that the branch will be __________ and fills these pipeflows with _________________.

The goal is to improve this guess. To do so you need to:

1. Have some way to predict which branches will be taken, then fill the pipe with the target stream following these branches. This requires that you …

2. Prefetch the target data so it’s ready for execution. Since it takes a while to fetch data from the cache, the prediction needs to be made well ahead of time so you have a chance to do this prefetching.

Implementation:

- The Branch Address Table is addressed by the Target Address of taken branches.
- The BAT contains the Target Address of the next taken branch. It also contains a count of how many branches are not taken before the next taken branch. Both of these fields are written into the BAT whenever a branch is first taken (i.e. –predicted branch).
- In words: the last time I branched to this location, the next taken branch was x branches later, and it branched to location y.
- Using the Predicted Target Address, prefetch the target data from the IF cache and keep it handy.
- Keep track of non-taken branches. When you get to the one that should be taken, let the next pipeflow use the prefetched target data instead of sequential data.
- The branch still needs to be processed and the branch decision checked to verify that the prediction was correct. Similarly, the generated target address is compared with the predicted address to make sure the address is correct.

Hedge Registers:

- On branches that are predicted to be not-taken, the branch flow still fetches the data and puts it in a Target Hedge Register, pending resolution of the branch. If it ends up being taken after all, this data can be loaded into the IDR to start up the next stream. Target Hedge Registers function just like _____ do in the current design.
- On branches that are predicted to be taken, save whatever data you have queued up for the sequential stream in a Sequential Hedge Register. That way if the prediction is wrong you can quickly restart the sequential stream.
- In either case, if you predict wrong you have to cancel the flows that got fired up. This is the same as it is today.
- This structure allows recursion. The Predicted Next Target Address can be used to address the BAT to get the follow-on Target Address, and so on. Each of these follow-on addresses can be sent to the IF cache to get the corresponding data, thereby building up a queue of target data (along with the associated addresses and NTCs).
- The current plan is 1 level of recursion (i.e. fetch next 2 targets) for the data and 3 levels of recursion for the address and the Not Taken Count.
**Branch Target Buffer**
*(obsolete)*

**Goal:** For each branch decision, correctly guess what the decision will be and have the data ready in time for the first D-cycle after the branch.

**Approach:** Keep track of taken branches and guess that they'll be taken again.

**Implementation:**
- Keys off of predecessor instruction address (instr. before branch) for timing reasons.
- In words: *Last time I was at this instruction address, the next instruction was a taken branch, so I'll assume that's what is going to happen this time.*
- The BTB saves the instruction data fetched from the previous time around, and loads it into the IDR to start processing.
- A "complete" BTB would have an entry for __________________________. Instead, only a portion of this conceptually huge address space is saved using standard caching techniques.
  * 256 sets x 2 associativities.
  * Addressed by low order bits of predecessor instruction address.
  * Remaining bits stored in TAGs and matched against.
- The "data" includes the branch data (i.e. target instruction) to be loaded into the IDR, plus the target address, which has two uses.
  - The predicted target address is compared with the calculated target address (from EAG) to provide early detection of an incorrect prediction.
  - It provides an early copy of the target address to access the BTB, as the target instruction could be a predecessor itself.
- This data can then be loaded into the IDR the cycle after the branch D-cycle.
* The branch processing continues as usual to allow verification of the BTB data:
  - verify that it is, indeed, a branch.
  - verify that the branch is taken.
  - check that the branch address is the same as the predicted address.
  - compare the data fetched by the branch flow with the data taken from the BTB.
* If any of these checks detects a problem, the pipeflows spawned from the BTB data are cancelled and the IDR is re-loaded with the correct instruction.

<table>
<thead>
<tr>
<th>Branch Target Buffer Timing</th>
</tr>
</thead>
<tbody>
<tr>
<td>Predecessor Instr D A T B X W</td>
</tr>
<tr>
<td>Branch D A T B X W</td>
</tr>
<tr>
<td>SU Target Fetch P A T B R</td>
</tr>
<tr>
<td>Branch address chk</td>
</tr>
<tr>
<td>Branch decode chk</td>
</tr>
<tr>
<td>Branch decision chk</td>
</tr>
<tr>
<td>Data mismatch chk</td>
</tr>
<tr>
<td>1st predicted instr D A T B X W</td>
</tr>
</tbody>
</table>
Branch Address Table Design

BAT Design

- 4K buffer, 2-way set-associative
- Addressed by Target Address 20:30
- TAGs include:
  * Target Address 12:19
  * Domain #
  * Guest/Host bit

- Access cycles include:
  * ab - address BAT cycle
  * br - BAT read cycle. On writes this becomes bw.

- Contents include a prediction (based on last time around) for:
  * Next Target Address 1:30
  * Non-taken count 0:3
    - Number of non-taken branches before next taken branch.
    - A count of F means invalid. This saves having a valid bit.

- The BAT can be accessed recursively in consecutive cycles. If this recursion gets interrupted it can be restarted from the LBRR (Last BAT Recursive Read). Predicted addresses can be saved in TGT1:2 and LBRR, with a 4th address on the BAT outputs.

IF Data Paths

- TGT1:2 AR are each sent down the IF pipe to prefetch target data, which is saved in their respective IB0s.
- In addition, the second 16 bytes for TGT1 are prefetched and stored in TGT1 IB1.
- When the predicted branch is encountered, the data from the corresponding TGT IB0 is loaded into the IDR/IB0 and processing commences on it. For TGT1 this can be further replenished from TGT1 IB1.
- Meanwhile, the top 22 unused bytes of the IDR/IB0 and SEQ IB1 are saved in a Sequential Hedge Register, pending verification that the branch is indeed taken.
- Sometimes the new target stream will immediately encounter another branch which is predicted to be taken. This may occur before the original branch is resolved. To handle this a second Sequential Hedge register is provided to save the just loaded IDR/IB0 contents. Once the branch decisions are resolved these hedges will either be cancelled or one of them will be re-loaded into the IDR/IB0.
- The IDR and IB0 from the current design are combined into one 22-byte register. The top bytes are used as the IDR, and the whole thing is viewed as IB0.
  - For Zero Cycle Branches you always want to have 8 bytes or more in the IDR. Thus, when the count falls to 8 you know you want to load in 16 bytes from IB1 next cycle. Meanwhile, at least 2 of the current bytes will be consumed by the pipe, so room is needed for 6+16=22 bytes.
  - Since the Sequential Hedge Registers are fed by the same shifter bus as the IDR/IB0, making them also 22 bytes each simplifies things.
Zero Cycle Branch

Basic Concept

• For certain cases of certain branch instructions (BC, BCR), execute the branches in parallel with the previous instruction.

Basic Implementation

• In addition to the original pipe (now called the E-pipe since it can use the E-unit), a second parallel pipeline (I-pipe) is created. This pipeline is dedicated to executing BC/BCR and has minimal facilities. Specifically, it has:
  - Instruction Address Generation
    * Generates the target address
    * Only has a 2 port adder. Either the base or index must be zero to do ZCB on a BC.
    * Has a dedicated selector to read out the base/index from the EAG GPRs.
    * No EGI bypass provided.
  - SU IF Interface
    * Interface to the IF pipe to fetch the target data.
    * SU IF pipe now has a 4 entry TLB. On TLB match (95% of IF's) this allows the target fetch to complete w/o the OP pipeline.
    * On TLB miss (done in the A-cycle) a traditional IF TLB validate is initiated a cycle later than normal, requiring the OP pipe.
  - Staging of address, opcode, misc. stuff
    * Used for exceptions, PER, Address Compare, STIS.

• An Extra (or Eligibility) cycle is added to the front end of the pipe.
  - The EIDR is examined to determine if the first two instructions can be "paired". A number of conditions must be met. A partial list includes:
    * The second instruction is a BC or BCR. If a BC, one of X or B must be zero.
    * Certain first instructions can't be paired with a BC.
    * No EGI.
  - This extra cycle also allows the IDR to be latched before being distributed to lots of DIDR copies. In the current design this distribution is done directly from the IF cache, creating some physical design problems.

• If pairing occurs (between the first instruction and the subsequent BC/BCR), the two instructions will proceed down their respective pipes in lockstep with each other. That is, if either pipe interlocks, they both will interlock.

• If all goes well (e.g. BC/BCR is eligible, IF TLB match) the BC/BCR can be executed using only the I-pipe, effectively costing zero cycles.
Successful Branch Prediction w/ZCB

Previous Branch
BAT Lookup
TGT1 IB0 Prefetch
TGT1 IB0
TGT1 IB1 Prefetch
TGT1 IB1
Recursive BAT Lookup
TGT2 IB0 Prefetch
TGT2 IB0

Misc. Flows

Misc. (E-pipe)
BC (I-pipe)
TGT1 -> IDR/IB0:1
IDR/IB0:1 -> SEQ HDG1
Branch decision
Target (E-pipe)
Misc. target stream flows

EDATBXWZ
ab br
EDATBXWZ
PATBR
EDATBXWZ
EDATBXWZ
EDATBXWZ
EDATBXWZ
EDATBXWZ
<table>
<thead>
<tr>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>-------</td>
</tr>
<tr>
<td>-------</td>
</tr>
<tr>
<td>-------</td>
</tr>
</tbody>
</table>
EDATBXWZ
EDATBXWZ
EDATBXWZ
EDATBXWZ
EAG Data Paths

• Effective Address Generation
  - EAG complex maintains a copy of the GPRs.
  - X and B fields used to select index and base GPRs, respectively.
  - Index and base GPRs fed to a 3 port adder, along with the displacement field.
  - Result is put into the AEAR which sends it to the S-unit.
  - Note the path from the RR to update the GPRs at the end of the W-cycle.

• EGI interlock
  - If a prior instruction is modifying the B or X GPRs, EAG must wait for it to be updated from the Result Register.
  - Can be a substantial performance penalty.

• Bypasses can buy some of this back.
  - RR Bypass
    * RR data sent directly into the adder at same time it's written to GPR.
    * Saves 1 cycle over no bypass.
  - OWR Bypass
    * OWR data bypassed into the adder.
    * Only works if bypassing from _______ instructions.
    * Saves 2 cycles over no bypass.
  - EAG Result Bypass
    * EAG complex duplicates ALU calculations being done in E-unit.
    * Done on NR, AR, ALR, LA, SLL, SLA (shift amounts of 0, 2, or 3).
    * Saves up to 5 cycles over no bypass.

<table>
<thead>
<tr>
<th>EGI Timings</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPR modifying inst.</td>
</tr>
<tr>
<td>W/O Bypass</td>
</tr>
<tr>
<td>RR Bypass</td>
</tr>
<tr>
<td>OWR Bypass</td>
</tr>
<tr>
<td>EAG Result Bypass</td>
</tr>
</tbody>
</table>
Register Array

Register Array - per Sequoia architecture
- Includes all architectural registers (GPRs, Control Registers, Timer Registers, etc.) except Floating Point and Vector Registers.
  - Superset of IBM defined registers.
- Defined to be a 256x32 bit array with each register in a defined location.
  - IBM defines as a variety of registers. Sequoia consolidates them all into one entity.

Register Array - as implemented
- 512x32 array implemented in RAM (256 scratch registers provided for μcode use).
- All architectural registers implemented in 1 of 3 types:
  - RAM Register: the only copy is in the RAM array.
  - LSI copies: the register is in RAM, but LSI copies are kept elsewhere.
  - Live registers: the only copy that's accurately maintained is in LSI. The RAM location is reserved but not used.
- RAM array capabilities (requirements) include:
  - concurrent read and a write to ____________.
  - can write even and odd registers in parallel for ____________.
  - read any two registers in parallel for ____________.

Implementation scheme
- Concurrent Read/Write:
  - 2 banks of RAM (A and B) implemented, each large enough to hold all the registers.
  - For a given location, only 1 bank contains the current, up to date copy.
  - A 1 bit LSI TAG, one per location, indicates which bank is up to date for that location.
- Writing register pair:
  - Even and Odd GPRs are put in separate RAMs.
- Reading 2 registers:
  - Duplicate this whole scheme to provide a second read port.
Timer Complex

Time Of Day clock
- Always running. Keeps track of "absolute" time.
- Architecture requires several versions.
  * Current Domain
  * Macrocode
  * Guest running in current domain
- A Macrocode TOD is maintained in a register. Epoch Differences provide the offsets for the other versions.

Comparators
- "Alarm clocks". The TOD is compared with these values and a rupt is generated when the TOD exceeds them.
- Two versions, Domain and Guest.

CPU Timer
- Only counts CPU time (i.e. when CPU is executing).
- Counts down to zero, then sends a rupt. Useful for time slicing.

Implementation
- Registers hold the current value of the various timers.
- They're multiplexed through an incrementor/decrementor to get updated once per 32 ns.
- 32 ns tick from oscillator card provides timing.
- The TOD is adjusted for Epoch offset, then compared with Comparator register to see if a rupt is needed.
- All registers loaded from the RR, and can load the OWR (via the __________).
Control Store

First flow of an instruction algorithm
- D-cycle control points hard are wired.
- A-cycle and subsequent control points come from FACS (First A-cycle Control Store).
  - 256x96 RAM (including parity).
  - Addressed by byte0 (opcode byte) of IDR.

Second and subsequent flows
- D-cycle control points come from DCS.
  - Two, 1Kx80 (including parity) banks.
  - Branches always go to the other bank. No branch penalty.
  - 2 deep ________________
  - Starting address is ________.
    * Always starts in bank B.
  - For 2-byte opcodes, the opcode is remapped into 10 bits for the CSADR.
    * _______ flow is the first unique control store access for a 2-byte opcode.
    * Always in bank A.

- A-cycle and subsequent control points come from MACS.
  - Same address structure as DCS, just different RAMs and control points.
Decode Opcode

- Decode Opcode goes down the pipe with each flow.
  - Hardwired control points are generated by decoding the Decode Opcode.
  - Sparse, stable control points are candidates for hardwiring.

- The Decode Opcode has two fields:

1. Decode Opcode
   - 8 bits.
   - original source is ____________.
   - Other sources include:
     * Value from previous flow. Multi-flow algs usually keep the same value throughout.
     * D1:2 + 4 bit D-cycle FACS field for __________. Unique DCD OPCD on ______.
     * 2 deep stack for __________.
     * DCS field used to change to a new value when you run out of Function Counts.

2. Function Count
   - Used to differentiate flows in a multiple flow algorithm (all have the same Dcd Opcd).
   - Word "Count" is a misnomer as this field doesn't just increment.
   - Can be reused if different instruction flows have same hardwired control points.
   - Starts at 8 on first flow.
   - Sourced from ________ on subsequent flows.
Op Exceptions

• For 1-byte opcodes, comes from ______.

• For 2-byte opcodes, use OP EX RAM.
  - 256x16 RAM.
  - Addressed by the second byte of the opcode.
  - Each column belongs to one 2-byte opcode family (same first byte).
  - FACS field selects which column to use.
  - Selected bit indicates opcode validity for that 2-byte opcode.

• Final selection chooses between the OP EX Ram (for 2-byte opcodes) and the FACS (for 1-byte opcodes).
Op Code Families

```
| 01000011 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
| 01000010 | 01000101 | 01001010 |
```

- Byte 1
- 256
Interlocks

General - all stages
- Inhibit pipe - signal used to freeze pipe state (e.g. for error recovery).
- Pipeline interlock - the downstream stage is valid and is interlocked.

D-cycle Interlocks
- Execute-Generate Interlock - prior instruction is modifying a GPR needed for EAG.
- Programmed Delay Interlock - μcode can force an interlock.
- Overlap Interlock - if overlap turned off, wait for pipe to clear.
- OWR Interlock - OWR EGI bypass shares a bus with RR. If RR is using it, can't bypass.
- I-fetch TLB Validate Interlock - waiting for S-unit access to do an IF TLB validate.
- Domain Interlock - like EGI but for some Sequoia registers.
- D-cycle Control Store Parity Error
- Access Register Interlock - deals with Access Registers, which we're ignoring.

A-cycle Interlocks
- Operand Priority Interlock - waiting for priority into the S-unit.
- A-cycle Control Store Parity Error
- ALB Interlock - deals with ALB, which we're ignoring.
- A eXception Valid Interlock - special interlock to fix a bug.

T-cycle Interlocks
- None.

B-cycle Interlocks
- BALRUS Interlock - BAL and RUS use CC as data. Need to get it set then pass to OWR.

X-cycle Interlocks
- Fetch Data Interlock - waiting for data from S-unit.
- E-Unit Busy Interlock - E-unit busy processing data.
- Condition Code Interlock - waiting for CC setter to allow branch decision.
- Syscom Interlock

W-cycle Interlocks
- None.
Process Control

Process switching per the POO

- 6 classes of interrupts:
  - External: Timer rupts, plus some miscellaneous rupts.
  - Program: detected during instruction execution.
    * e.g. overflow, translation exception, operation exception (illegal opcode)
  - Machine check
  - Supervisor Call: this is an instruction.
  - I/O: initiated by conditions or events in the I/O subsystem.
  - Restart: used by console or another CPU.

- Upon taking the rupt:
  1. Stop processing the current instruction stream.
  2. Store the current PSW as Old PSW into a fixed location in page 0.
  3. Store an interrupt code describing the interrupt (into a fixed location).
  4. Load the New PSW from a fixed location and start processing.
**Process Control**

**State Machine - Normal Process Switch**

**End Process State**
- wait while ucode runs

**Set Proc St.**

**Process State**
- Process instr.
- wait for a rupt

**rupt**

**Restore State 2**
- Fetch 1st μcode
- set D valid

**STQ empty**

**Restore State 1**
- cncl, inhib pipe
- flush St Queue
- init. CSADR
Process Control (cont.)

Process Switch implementation

Process Control State Machine:

- Process state - normal state for processing instructions.
  - RS1
    - Cancels/inhibits the pipe and waits for the Store Queue to flush.
    - Starts reconstructing old PSW from ZAR/WAR.
    - Sets up CSAR address (to be loaded next cycle).
  - RS2
    - Finishes reconstructing the old PSW (clocks TOAR w/address).
    - Sets up D-valid (to be loaded next cycle).
- End process state
  - waits while µcode does rupt processing.

Rupt handling µcode:

- Stores rupt code info, as needed.
- Does any other special handling required (e.g. I/O rupts).
- Loads New PSW, and fetches target instruction.
- Asserts SET PROC STATE, kicking Process Control back into Process State.

<table>
<thead>
<tr>
<th>Process Switch Timing Example</th>
</tr>
</thead>
<tbody>
<tr>
<td>Instr. taking rupt</td>
</tr>
<tr>
<td>Next instruction</td>
</tr>
<tr>
<td>Rupt Taken</td>
</tr>
<tr>
<td>Inhibit Pipe</td>
</tr>
<tr>
<td>Cancel Pipe</td>
</tr>
<tr>
<td>Process Control State</td>
</tr>
<tr>
<td>WAR/ZAR + ILC → OIR</td>
</tr>
<tr>
<td>OIR → TOAR</td>
</tr>
<tr>
<td>µcode rupt processing</td>
</tr>
<tr>
<td>Misc.</td>
</tr>
<tr>
<td>Fetch Target Inst.</td>
</tr>
<tr>
<td>SET PROC STATE</td>
</tr>
<tr>
<td>Target Instr. Flow</td>
</tr>
</tbody>
</table>

2-29

AMDAHL INTERNAL USE ONLY
Process Control (cont)

- **Start/Stop**
  
  **SVP**
  - Issue STOP command (or after a reset).
  
  **State Machine**
  - Cancel pipe and wait for Start from SVP. (*Stop State*)
  
  **SVP**
  - Issue START command.
  
  **State Machine**
  - Set up CSADR and go to RS2. (*Start State*)
  - From there on it looks like normal rupt handling.

- **SVP Op loop**
  
  **SVP** (*Stop State*)
  - Scan in an instruction or pipeflow w/clocks off.
  - Turn clocks on.
  
  **State Machine**
  - Turn on D valid and start execution. (*Load State*)
  - Wait until done, then flush Store queue. (*SVP Execute, Clear States*)
  - Return to STOP State. (*SVP Clear State*)

- **Error handling loop**
  
  **State Machine**
  - Freeze pipe state and E-unit. (*Error Idle 1*)
  - Wait for clocks to go off using stall counter. (*Error Idle 1*)
  
  **SVP**
  - S-code repairs damage, then turns clocks back on.
  
  **State Machine**
  - Flush Store Queue. (*Error Clear State*)
  - Request SVP Aid, if needed. (*Error Idle 2*)
  
  **SVP**
  - With I/E clocks off, S-code assembles MCIC for Macrocode, based on log analysis.
  
  **State Machine**
  - Reconstruc Old PSW and go to RS2. (*Error RS1*)
  - If not past retry point, refetch from Old PSW. Otherwise, load CSAR to point to μcode to do rupt processing and fetch first μinstruction. (*RS2*)
  
  **μcode (if no retry)**
  - Store OLD PSW and MCIC.
  - Load New PSW and start processing.
E-unit
Sub-units

The E-unit is made up of 3 sub-units:

**Floating point**
- Handles floating point calculations, per the Floating Point chapter in the POO.

**Decimal**
- Handles decimal calculations, per the Decimal chapter in the POO.

**Fixed point**
- Handles everything else, especially the General Instruction chapter in the POO.

Each sub-unit has its own versions of the OWR and RR.
Fixed Point
Fixed Point Basic Blocks

Rev. 1, 5/91

Adder
Multiplier
Divider
Logical Unit
Shifter

Result Register

OWR

LO

HI

IU GPRs

OP Buffer

to Op Buffer, IU GPRs

AMDAHL INTERNAL USE ONLY
Fixed Point - Basic Blocks

- The fixed point contains 5 fairly independent blocks:
  - Adder/subtractor
  - Multiplier
  - Divider
  - Logical Unit
  - Shifter

- Data is in two's complement notation.
  - Halfword, word, and doubleword lengths.
\[ C_1 = G_2 + P_2 \cdot C_{in} \]
\[ C_0 = G_1 + P_1 \cdot G_2 + P_1 \cdot P_2 \cdot C_{in} \]
\[ G_{out} = G_0 + P_0 \cdot G_1 + P_0 \cdot P_1 \cdot G_2 \]
\[ P_{out} = P_0 \cdot P_1 \cdot P_2 \]
9-bit Carry Propagate Adder

EXAMPLE ONLY, NOT IN THE DESIGN!
- Used here to illustrate concepts.

• Full Adder
  - 1 per bit.
  - Sums $A_n$, $B_n$, and $C_{in}$.
  - Also calculates Propagate and Generate for each bit.
    * $P_n = A_n + B_n$ (i.e. a carry into this bit will cause a carry out of this bit).
    * $G_n = A_n \cdot B_n$ (i.e. a carry-out is generated, irrespective of the carry in).
    * Note that $P$ and $G$ are independent of $C_{in}$.

• Carry Propagate
  - Bits grouped by three. Each group has a Carry Propagate Element (my name).
  - Based on the $P_n$ and $G_n$ inputs from the 3 Full Adders, plus the $C_{in}$ to the group:
    * Calculates $C_{in}$ into the top 2 bits (the low order bit gets the group $C_{in}$).
  - Based on the $P_n$ and $G_n$ inputs only (not on $C_{in}$ to the group):
    * Calculates $P$ and $G$ for the 3 bits as a whole.
      P means the 3-bit group will propagate a carry coming in.
      G means the 3-bit group generates a carry-out on its own.
  - Can stack elements to make larger adders:
    * One 3-bit element groups together three lower 3-bit elements to form a 9-bit adder.
    * Elements don't have to be same size at each level. Could make a 12-bit adder with a 4 "bit" element at the highest level.
n = arbitrary number of bits

A

B

Cin

A + B

SUM

'0'

A + B + 1

'1'

Cin

Carry In

n = arbitrary number of bits
Fixed Point CPA

Conditional Sum Adder

- For a group of bits, calculate the sum with and without the carry-in.
- Let the carry-in select the appropriate sum.

Fixed Point CPA

- Propagate Structure
  - 1st level groups bits by 8 to form byte P and G.
  - 2nd level bundles the 4 bytes to form byte \( C_{in} \)’s.

- Carry in structure
  - Conditional sum done at nibble level.
  - Byte \( C_{in} \)’s combined with bit P and G to generate nibble \( C_{in} \)’s.
    * Low order nibble gets byte \( C_{in} \) directly.
    * High order nibble gets byte \( C_{in} \) combined with 4 low order bit P/G’s.

- Adds 32 bits in 1 cycle.
## Multiply Algorithms

### 32 bits X 16 bits

**Standard "Shift and Add" Multiply Algorithm**

<table>
<thead>
<tr>
<th>B0</th>
<th>B1</th>
<th>B2</th>
<th>B3</th>
<th>B4</th>
<th>B5</th>
<th>B6</th>
<th>B7</th>
<th>B8</th>
<th>B9</th>
<th>B10</th>
<th>B11</th>
<th>B12</th>
<th>B13</th>
<th>B14</th>
<th>B15</th>
<th>times</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>A</td>
<td></td>
<td>1</td>
<td>A×2^1&lt;br&gt; 1</td>
<td>A×2^2&lt;br&gt; 1</td>
<td>A×2^3&lt;br&gt; 1</td>
<td></td>
<td>1</td>
<td></td>
<td>1</td>
<td></td>
<td>1</td>
<td></td>
<td>1</td>
<td>A×2^4&lt;br&gt; 1</td>
<td>A×2^5&lt;br&gt; 1</td>
</tr>
</tbody>
</table>

**Modified Booth's Multiply Algorithm**

<table>
<thead>
<tr>
<th>B0</th>
<th>B1</th>
<th>B2</th>
<th>B3</th>
<th>B4</th>
<th>B5</th>
<th>B6</th>
<th>B7</th>
<th>B8</th>
<th>B9</th>
<th>B10</th>
<th>B11</th>
<th>B12</th>
<th>B13</th>
<th>B14</th>
<th>B15</th>
<th>times</th>
</tr>
</thead>
<tbody>
<tr>
<td>-1</td>
<td>A</td>
<td></td>
<td>-2</td>
<td>1</td>
<td>1</td>
<td>A×2^1&lt;br&gt; 1</td>
<td>-2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td></td>
<td>1</td>
<td></td>
<td>1</td>
<td>A×2^4&lt;br&gt; 1</td>
<td>A×2^5&lt;br&gt; 1</td>
</tr>
</tbody>
</table>
Fixed Point Multiply Algorithm

Standard "Shift and Add"

- As you traverse Multiplier from right to left:
  - Add in Multiplicand, if Multiplier bit is a 1.
  - Shift Multiplicand left 1 bit (i.e. multiply by 2).
  - Move on to next Multiplier bit.

Modified Booth's Algorithm

- Examine Multiplier bits in triplets, moving left 2 bits at a time (i.e. edge bits of triplet are shared with adjacent triplets).
- From most to least significant Multiplier bits, contribution is -2, 1, 1. (See table, p. 3-12).
- Possible values for a given row (i.e. given multiplicand shift amount) are ___________. All of these can be generated by _____________________.

<table>
<thead>
<tr>
<th>Triplet Value</th>
<th>Multiplicand Select</th>
</tr>
</thead>
<tbody>
<tr>
<td>000</td>
<td>0</td>
</tr>
<tr>
<td>001</td>
<td>1</td>
</tr>
<tr>
<td>010</td>
<td>1</td>
</tr>
<tr>
<td>011</td>
<td>z</td>
</tr>
<tr>
<td>100</td>
<td>-2</td>
</tr>
<tr>
<td>101</td>
<td>-1</td>
</tr>
<tr>
<td>110</td>
<td>-1</td>
</tr>
<tr>
<td>111</td>
<td>0</td>
</tr>
</tbody>
</table>
Multiplier Implementation

POO Definition
• Multiplies two 32-bit operands to form a 64-bit result to be stored into a register pair.

Implementation
• Breaks 32x32=64 bit multiply down into two 32x16=48 bit multiplies.
• Current Multiplier half (16 bits) is recoded per Booth's algorithm, then controls shift and adding of Multiplier. Generates 9 terms.
• Carry Save Adder tree reduces nine 32 bit terms to two 48 bit terms, then adds in upper 32 bits from prior cycle in the last CSA level.
• Multiply Carry Propagate Adder sums the final two terms.
• Low 16 bits from first cycle are concatenated with 48 bits from 2nd cycle to form final 64 bit result.
• Takes ___ X-cycles.

 Carry Save Adder
• Technique for multiple operand addition.
• Basic element is a 3 input adder:
  - For each bit position it generates a Sum bit and a Carry bit.
  - Instead of propagating the Carry, it's shifted left 1 bit to form a new operand.
  - Output is two operands, a Sum and a Carry. Thus, the CSA reduces 3 operands to 2.
• By stacking these CSAs into a tree, multiple operands can be reduced to 2, without having to propagate any carries.
• Eventually, a CPA is needed to sum the final two terms. The CPA structure is:

![MCPA Propagate Structure](image-url)
Fixed Point Divider

Divisor

from OWR

Partial Quotient

Sign Bits

Partial Remainder

Quotient to RR

Remainder
Fixed Point Divider

**POO Requirements**
- Dividend - 64 bits
- Divisor - 32 bits
- Quotient - 32 bits
- Remainder - 32 bits

**Implementation**
1. Load so the Dividend is positive and the Divisor negative. (Paths not shown.)
2. Do trial subtractions (i.e. additions of negative) of 1 to 7 times Divisor from Dividend 0:33.
   - All 7 trial values are obtainable with shift, complement, and 1 addition. This is built into the adders.
   - Dividend bit 33 is aligned with Divisor bit 31. This allows the Dividend value to be up to 7 times the Divisor value.
3. Select "winning" result back into Dividend 0:30.
4. Left shift appropriate Partial Quotient bits into Dividend 31:63, using the room left by the 3 bits that participated in the subtractions.
5. When done, the Remainder is in the upper half of Dividend, and the Quotient is in the lower half.
6. Does 3 quotient bits/cycle.
7. In the picture the Dividend register is broken into 2 parts, R and Q:
   - R is loaded directly with the selected remainder from the trial subtractions.
   - Q is a shift register; can do 3-bit left shifts.
   - Except for bit 31, R and Q correspond to the Remainder and Quotient at the end.

Note: If the Dividend has excess leading zeros, they can be shifted out (by a multiple of three) via the shifter prior to starting the alg, and the number of quotient iterations is then reduced appropriately. The shift amount is basically _________________.

AMDAHL INTERNAL USE ONLY

AM 3493
<table>
<thead>
<tr>
<th>Bit</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
</tr>
</thead>
<tbody>
<tr>
<td>Byte 0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Byte 1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Byte 2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Byte 3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Byte 4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Byte 5</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Byte 6</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Byte 7</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Fixed Point - Shifter

POO Requirements
- Shift double word left or right by 0 to 63 bits.
- Sign bit may or may not be included.
- Shift-in value may be zero or the sign bit.

Implementation
- Shift done in two stages. For shift amount $S$:
  1. Rotation done within bytes. This rotation is same for all bytes and $= \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_$.
  2. Bits are shifted in byte multiples (maintaining position within byte).

- Sign bit and shift-in values are details not covered here.
E-unit Control Store

- Each Sub-unit has its own Control Store. Basic structure is the same for all.
  - FX CS is $1024 \times 128$
  - FP and DU CS are each $256 \times 9$

- E-unit μcode is tightly coupled with I-unit μcode.
  - Especially multi-flow algorithms.
  - e.g. E-unit μcode may assume data will be in the OWR without explicitly checking for it.

- The basic control store structure includes:
  - I-unit sends an opcode in the T-cycle which serves as the starting CS address.
    * Opcode can be held in a register in case _________________________.
  - Two banks. Increment through 1 bank, branch to the other. Similar to I-unit.
  - Background scrub machine to access CS when sub-unit is idle, searching for errors.
Floating Point
# Data Formats

## Short

<table>
<thead>
<tr>
<th>S</th>
<th>Characteristic</th>
<th>Fraction</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>7 8 31</td>
</tr>
</tbody>
</table>

## Long

<table>
<thead>
<tr>
<th>S</th>
<th>Characteristic</th>
<th>Fraction</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>7 8 63</td>
</tr>
</tbody>
</table>

## Extended

<table>
<thead>
<tr>
<th>S</th>
<th>High Order Characteristic</th>
<th>Fraction - High Half</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>7 8 63</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>S</th>
<th>Low Order Characteristic</th>
<th>Fraction - Low Half</th>
</tr>
</thead>
<tbody>
<tr>
<td>64 65</td>
<td>71 72</td>
<td>127</td>
</tr>
</tbody>
</table>

## Floating Point Registers

<table>
<thead>
<tr>
<th>FPR #</th>
<th>Pair</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>63</td>
<td>Pair</td>
</tr>
<tr>
<td>2</td>
<td>63</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>63</td>
<td>Pair</td>
</tr>
<tr>
<td>6</td>
<td>63</td>
<td></td>
</tr>
</tbody>
</table>
Floating Point Architectural Elements

Data Formats

- First bit is sign bit.
- Fraction is in Hex with Hex point at the left. (Elsewhere, called Mantissa.)
- Characteristic is exponent (base 16) in excess 64 notation. Thus ...

\[
N = \text{Fraction} \times 16^{(\text{Characteristic}-64_{\text{dec}})}
\]

- Format precision varies (but the characteristic is always the same):
  - Short: 6 digits
  - Long: 14 digits  
    This is the "standard" format the FPU is built around.
  - Extended: 28 digits  
    (note: low order characteristic is ignored during processing, but is set to hi order characteristic - 14 when results are stored)

Floating Point Registers (FPRs)

- 64 bits (= long format)
- Register pairs are used for extended operations.
- Only used in Floating Point instructions.
- The Floating Point Unit has the only copy of these registers.

Instruction types

- Add, Subtract, Multiply, Divide, Load, Store
- Some instructions require results to be normalized (leading digit made non-zero).
- See Chapter 9 of the POO for details.
Floating Point Basic Blocks

- Buffer data comes into OWR

- Two Read Busses select the operand sources.
  - At least one is an FPR (FP ops are RX or RR).

- Separate sections for fraction addition, multiplication, and division.

- One exponent section for all operations
  - On adds/subtracts, determines alignment shift amount and sends to adder.
  - Receives leading zero digit count for normalization (ie. to decrement exponent).
    (Slight lie in picture - Division LZDC goes through multiplier)
  - Sign bit also handled here, but not included in any pictures or discussion.

- Write bus writes the results back to the FPRs.
  - Result Register only sourced from FPRs. Only needed for ____________.
Floating Point
Exponent
Complex

Exponent Difference Calculator

RB1 1:7

Min (15, |RB1-RB2|)

Exponent Adder

Max

EXPR

Product Exponent Adder

Sum Exponent Adder

WB1:7

+1

- Sum LZD Count

- Mult. LZD Count

EXPR

AMDAHL INTERNAL USE ONLY
Exponent Complex

Difference Calculator
- Calculates the ABSOLUTE VALUE of the difference between the two exponents and sends to the adder complex.
- The max shift amount for Long Operations is 15. Past that and you're just adding zero.
- An extended shift amount (max is 31) is also calculated and sent out. Not shown.

Exponent Adder
- Calculates the new exponent value and loads into the EXPR.
  - For Multiply/Divide
  - For Add/Subtract
  - Can also pass through RB2 unchanged, and RB1 has a direct path to EXPR.

Sum Exponent Adder
- Decrements exponent by the LZD Count from the adder for normalization.
- Increments on
- Sends resulting characteristic out on the Write Bus.

Product Exponent Adder
- Decrements exponent by the LZD Count from the multiplier for normalization.
- On divides, the quotient is accumulated in the multiplier, so the same LZD can be used.
- ±64 input used to
- -14 input used to
Floating Point Adder

Enable

RB1 8:63

RB2 8:63

Cmpl

SHIFTER

Alignment Shift Amount (from Exponent Complex)

carry out

LZDC

LZD Count (to Exponent Complex)

Enable

to WB 8:63
Floating Point Adder

- Fraction with smaller exponent is right shifted for alignment.
  - Per the Alignment Shift Amount from the Exponent Complex.

- The other operand may be complemented.
  - Complement if __________ of minus signs, where subtraction counts as 1 minus sign.

- The adder has a latch point in the middle.
  - Sum w/o byte carries is computed and latched.
  - Then the carries are added in.

- If needed a recomplementation is done.

- Shifter normalizes results.
  - Based on the carry out and Leading Zero Digit Count.
  - LZDC is also sent over to the exponent complex.
Floating Point Multiplier

2-cycle path

RB1 8:63

29 terms

Tree

CSA

Carry

RB1

112

Sum

112

Quotient

4

WB 8:63

to Exponent Complex

Digit Shift

LZD Count

from Exponent Complex

Extended Shift Amount

X SAR

5

0:111

0:55
Floating Point Multiplier

Multiplies

- POO requires various combinations. Examples include ...
  - Short x Short = Long
  - Long x Long = Long (truncated)
  - Long x Long = Extended
- Discussion focuses on $L \times L = E$. Other flavors are similar.
- Like Fixed Point multiply, but bigger.
  - RB2 Recoded (modified Booth's alg) to select different bit multiplicand terms.
  - CSA tree adds these to generate Carry and Sum terms, bits each.
  - Multiplier CPA adds Carry and Sum. Propagate done by "brute force" in 3 levels:
    1. Generates lots of common terms (e.g. consecutive strings of propagates).
    2. Calculates, for each pair of digits, the carry-in from the lower to the higher one.
    3. ORs in, for each digit, all carries from lower digits.
- If no Leading Zeros, can send out over WB right away
  - Send the high half immediately.
  - Latch result into BR then shift by 14 to send the low half onto the WB.
- If there are leading zero digits ...
  - Latch result into BR.
  - Shift left based on LZD Count.
  - Send out result (same process as above).

Extended Adds

- Load operands into AR and BR.
- Using the Digit shifter, shift the fraction that has the smaller exponent by the Extended Shift Amount from the Exponent Complex. This fraction is restored into BR, and the other is put into AR.
- Add and post-normalize the same way multiply is done.

Divides

- The divider sends over 4 quotient bits per cycle, which are shifted into the BR.
- When quotient is complete, normalization then proceeds as above.
Floating Point Divider

RB1 8:63

RB2 8:63

DR

DVSR * 3

DVSR

DR

12xDV

DR

8xDV

DR

4xDV

TR

3xDV

TR

2xDV

TR

Q2:3

Q0:1

Remainder 8:63

CARRIES

CARRIES

AMDAHL INTERNAL USE ONLY

Rev. 1.591
Floating Point Divider

- **Two stage divide:**
  - Do trial subtractions of 4, 8, and 12 times the Divisor. This determines $Q_0:1$.
  - Do trial subtractions of 1, 2, and 3 times the Divisor (subtracting from the Temporary Remainder calculated from the previous stage). This determines $Q_2:3$.

- **Send Partial Quotient** ($Q_0:3$) **to Multiplier for accumulation.**

- **Load Remainder back into DR.**

- **Using DVSR*3 allows all subtractions to be done via shift and subtract. To load DVSR*3:**
  - First, DR is loaded with $16\times RB2$ and DVSR*3 is loaded with RB2 (RB2 has the Divisor).
  - These are sent through the $-12\times DVSR$ subtractor. Since this subtractor assumes the DVSR has already been tripled, it's designed to do $X-4Y$, which gives $(16-4)\times DVSR = 12\times DVSR$.
  - This is shifted right 2 bits and loaded into DVSR*3 the next cycle.
  - Meanwhile, DR and DVSR are loaded with unaltered RB1 and RB2 values.
Decimal
Decimal Data Formats

Zoned Format

\[
\begin{array}{cccccccccccccc}
\end{array}
\]

Packed Format

\[
\begin{array}{cccccccccccccccc}
D & D & D & D & D & D & D & \cdots & D & D & D & D & D & D & D & S
\end{array}
\]

Z = Zone  \quad N = \text{Numeric}  \quad D = \text{Decimal Digit}  \quad S = \text{Sign}

Each square represents 4 bits
Decimal Data Formats

2 formats, Zoned and Packed. Both are:
- Based on strings of bytes, each containing two 4-bit fields.
- Variable length ➞ only used in SS ops.

Zoned
- First field is called Zone. This can be anything.
- Second field is called Numeric. Often it's a decimal digit.
- In the rightmost byte, the zone may be a sign digit.
- This format is set up for EBCDIC data manipulation. For example, decimal numbers in EBCDIC all have the same zone field value, and the numeric field contains the binary representation of the digit.

Packed
- String of decimal digits, terminated by a sign code.
- Digits must be 0-9. Sign is A-F, where low order bit is the actual sign bit.
  ➞ This is the format used by all arithmetic decimal ops.
  - Add, Subtract, Multiply, Divide

Some instructions are provided that convert between these two formats:
- Pack: converts from Zoned to Packed. Basically just strips out the zones.
- Unpack: converts from Packed to Zoned. Inserts F into Zone, making it EBCDIC.
- Edit, Edit & Mark: Very hairy. Converts from Packed to Zoned and allows lots of modifications on the way. Masochists are referred to Chapter 8 of the POO.
Decimal Unit

• Basic elements include:
  - The OWR (note that it's only fed from the buffer, since all OPs are SS).
  - Scratch Registers 1 and 3. SR3 is loaded with 3xSR1.
  - Digit Multipliers producing multiples of SR1 from 2 to 9.
  - Adder Register.
  - Result Register.
  - Adder/subtractor - operates on the RR and the Adder Register.
  - Multiplier/Quotient Register.
  - NOTE: data paths are all 8 bytes wide.

• Addition
  - Load the AR and RR with the operands. Then add into the RR.

• Multiplication
  - Per POO: Multiply Decimal allows a Multiplier of ≤8 bytes and a Multiplicand of ≤16 bytes. Multiplicand must have enough leading zeros to ensure the result is ≤16 bytes.
  - Load the Multiplier into the MQR and the Multiplicand into SR1,3. Clear the RR, which will act as an accumulator.
  - Use MQR56:59 (60:63 is the sign) to select the proper multiple of the Multiplicand.
  - Add this to the contents of the RR and store the result back into the RR, shifting right as you do. The rightmost digit is shifted into the MQR and its rightmost digit is discarded.
  - Continue until done. The product is in the RR and MQR and can be read out over 2 cycles, if needed. A shifter (not shown) aligns the data before doing so.

• Division
  - From POO: DVSR ≤ 8 bytes, Dividend ≤ 16 bytes.
  - Load Divisor into SR1,3 and also put the upper 5 bits into DVSR. Load Dividend into RR.
  - Upper 8 bits of Dividend/Remainder and upper 5 bits of Divisor address a lookup table for a lower bound guess at the quotient digit. (Assumes remaining Divisor bits are ____ and Dividend bits are ____.)
  - Based on this guess, select a Divisor multiple and subtract from the Remainder.
  - Keep incrementing QDR and subtracting the Divisor times 1 until you get a carryout (ie. number goes negative).
  - At this point, don't load the RR, it has the correct Remainder. QDR2 has the correct Quotient, which is shifted into MQR.
  - Continue till done. MQR has the Quotient, RR has the Remainder.
Basic S-unit Concept

Rev. 1, 5/91

Select highest priority request
Do the work
Post results

SC request
I-unit request
Internal & recycled requests
Ports

Opcode Address Flags

Request:

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Logical</th>
<th>(Physical)</th>
<th>(Misc)</th>
<th>Key</th>
<th>DIM ID</th>
<th>V/R</th>
<th>Misc</th>
</tr>
</thead>
</table>
Basic S-unit Concept

- Pipe has 3 basic stages:
  - Select the highest priority request, including internal and external requests. Split across P and A cycles.
  - Do the work for that request. Split across T and B cycles.
  - Post the results. Done in B and R-cycles.

- 580 had just PBR. Apache and Sona added A and T for timing reasons.

- A request includes everything needed to complete processing, including:
  
  Opcode (~150 of 256 used):
  
  I-unit: Fetches, stores, branch, SC ops (e.g. XSU stuff), TLB maintenance, register loading, misc.
  
  SC: Move-in flows, move-out flows, key ops, misc.
  
  Internal: TAG maintenance, TLB maintenance, translator flows, misc.

- Address:
  
  Logical (called Effective in I-unit)
  
  Physical (not supplied on I-unit requests)
  
  Misc (STD)

- Flags:
  
  Key
  
  Address Dimension
  
  Virt/Real
  
  Others

- Ports store the request for recycling.
S-unit Pipelines

- **S-unit has two parallel pipelines, IF and OP.**
  - Each has its own TLB, TAGs, and Buffer.
  - OP sends data to the _____, IF sends data to the _____.
  - Pipelines are free-running; incomplete requests recycle until complete.

- **Two common OP pipe requests from the I-unit are Fetches and Stores.**
  - Much of the S-unit is tailored to these operations.
  - Fetches read data out of the buffer and into the OWR.
  - Stores have two parts.
    * The *store* flow reads the data from the buffer and into the OWR.
    * The *write* flow writes data (sent from the RR) into the buffer.
    * The store flow of a store is handled a lot like a normal fetch.
S-unit OP Pipe Basic Blocks

- **TAGs, TLB**
  - used to determine if line is present in the cache.

- **Buffer (a.k.a. cache)**

- **Translator**
  - does Virtual to Physical Address translations.

- **Fetch Ports**
  - contain fetch requests until they complete.

- **Store Ports**
  - contain write flows (of stores) until they complete.

- **Search Machine**
  - does background TLB maintenance.

- **Scrub Machine**
  - does background searches for single-bits errors.

- **SC Requests**
  - path used by SC to move data into and out of the buffer.

- **I-unit Request Processing sequence:**
  1. Requests priority in the A-cycle.
  2. If granted, TAG and TLB match done in the T. Buffer accessed in B.
  3. If line present, status valid posted and data clocked into the OWR.
     * Status Valid is a key signal from the S-unit. Indicates request completion, even for requests that don't return data. Lack of Status Valid leads to ______ in the I-unit.
  4. Otherwise, request is loaded into a fetch port.
  5. Later, the fetch port requests priority into the P-cycle.
  6. If granted, it contends for A-cycle priority and continues as from 1.
Address Translation

Segment Table Designation

Virtual Address

Memory

Segment Table

Page Table

Real Address

Prefixing

Swap page 0
and prefix page.

Absolute Address

Address Dimension

MRU Table

Physical Address

AMDAHL INTERNAL USE ONLY

Rev. 1, 5/91
Address Translation

Dynamic Address Translation
- Maps Virtual Address to Real Address on 4K boundaries.
- IBM defined. Enabled by a _____ bit.
- Uses 2-level lookup of tables stored in memory.
  - Segment Table Origin 0:19 (left justified) points to beginning of segment table.
  - VA1:11 (for _____ segments) indexes into segment table in 4 byte increments.
  - Segment table entries are 1 word. Bits 1:25 form a Page Table Origin and (left justified) point to a page table.
  - VA12:19 indexes into the page table in 4 byte increments.
  - Page table entries are 1 word. Bits 1:19 form the Page Frame Real Address (i.e. bits 1:19 of the Real Address).
  - The PFRA is then used in place of the high order 19 bits of the address.
  - VA20:31 = RA20:31. No translation done on these bits.

Prefixing
- Maps a Real Address to an Absolute Address.
- Also IBM defined.
- Swaps the prefix page (pointed to by a prefix register) with page 0. All other page addresses are unchanged.
- Allows each CPU's page 0 to point to a different address in memory.

Main Store Reconfigurable Unit Table Lookup
- Maps an Absolute Address to a Physical Address.
- Amdahl defined. Implemented in dedicated RAM.
- AA1:9, along with the Address Dimension, index into a table which provides PA0:9. AA10:31 are unaltered.
- Allows Domains (each Domain is in a different Dimension) to map to its own chunk of memory, and to give an Addressing Exception if a Domain tries to go outside its bounds.
- Also used to reconfigure Main Store in 4M chunks.
- Called MRU Table or MRUT.
MAIN MEMORY

<table>
<thead>
<tr>
<th>STD</th>
<th>VA</th>
</tr>
</thead>
<tbody>
<tr>
<td>00003001</td>
<td>00903A5F</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Address</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>3000</td>
<td>34420A4A</td>
</tr>
<tr>
<td>3004</td>
<td>4820BC8A</td>
</tr>
<tr>
<td>3008</td>
<td>BB836B54</td>
</tr>
<tr>
<td>300C</td>
<td>34420A4A</td>
</tr>
<tr>
<td>3010</td>
<td>9BA6473B</td>
</tr>
<tr>
<td>3014</td>
<td>C8574387</td>
</tr>
<tr>
<td>3018</td>
<td>578AECEF</td>
</tr>
<tr>
<td>301C</td>
<td>34420A4A</td>
</tr>
<tr>
<td>3020</td>
<td>00003D41</td>
</tr>
<tr>
<td>3024</td>
<td>00003054</td>
</tr>
<tr>
<td>3028</td>
<td>FF738A63</td>
</tr>
<tr>
<td>302C</td>
<td>0000349A</td>
</tr>
<tr>
<td>3030</td>
<td>A7D8F9EE</td>
</tr>
<tr>
<td>3034</td>
<td>00003442</td>
</tr>
<tr>
<td>3038</td>
<td>34420A4A</td>
</tr>
<tr>
<td>303C</td>
<td>56734E2A</td>
</tr>
<tr>
<td>3040</td>
<td>4820BC8A</td>
</tr>
<tr>
<td>3044</td>
<td>FFF00123</td>
</tr>
<tr>
<td>3048</td>
<td>47584FDA</td>
</tr>
<tr>
<td>304C</td>
<td>00489AE7</td>
</tr>
<tr>
<td>3050</td>
<td>15283754</td>
</tr>
<tr>
<td>3054</td>
<td>ADFFD456</td>
</tr>
<tr>
<td>3058</td>
<td>47386954</td>
</tr>
<tr>
<td>305C</td>
<td>98473859</td>
</tr>
<tr>
<td>3060</td>
<td>A6734251</td>
</tr>
<tr>
<td>3064</td>
<td>39CD70F3</td>
</tr>
<tr>
<td>3068</td>
<td>A8DF3490</td>
</tr>
<tr>
<td>306C</td>
<td>32859604</td>
</tr>
</tbody>
</table>

*note: all numbers are in hex
Translation Lookaside Buffer

- Holds recently used translations.

- 256 sets x 2 associativities.

- Addressed by ________.

- Match against EA1:12, STO, Address Dimension, various flags.

- TLB "data" is ________________.

- Real to Physical address translations are also stored in the TLB.
Buffer Organization

• 128 sets x 8 associativities

• 128 bytes lines.

• TAGs contain ________________.

• Addressed by ________________.
**OP TAGs/ TLB**

- TAGs/TLB used to determine line state. Possible states are:

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td>Yes</td>
<td>No</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>No</td>
<td>Yes</td>
<td>No</td>
<td>NA</td>
</tr>
<tr>
<td>No</td>
<td>No</td>
<td>No</td>
<td>NA</td>
</tr>
</tbody>
</table>
IF TLB

- IF TLB is a copy of the PA portion of the OP TLB.
  - When creating TLB entries, Translator writes to both OP and IF TLB.

- Stream register holds TLB address for current page.
  - Current page = page containing current instruction address.
  - Loaded/validated by accessing OP TLB (via OP pipe) on branches and when instruction stream crosses 4K page boundaries (a.k.a. IF TLB Validate flow).
  - This validation flow accesses the OP TLB to do a match to make sure a valid translation is in the TLB.
  - If the validation flow gets a match, the matching location (TLB address and associativity) is saved in a stream register.
  - All subsequent IF flows for this stream use this "pointer" in the stream register to just read the PA out of the TLB and use it for TAG match; TLB match is implicit.
  - Note that this means reading out the same entry from the IF TLB over and over until you branch or cross into a new page.

- stream registers allow for late branch decisions.

<table>
<thead>
<tr>
<th>Stream Register</th>
<th>Timing</th>
</tr>
</thead>
<tbody>
<tr>
<td>IF TLB Vldte (OP,IF)</td>
<td>A T B R</td>
</tr>
</tbody>
</table>
| OP TLB Match | |-
| Stream reg loaded | |----->
| Seq IF - IF Pipe | A T B R |
| Access TLB | |-
| TAG Match | |-

AMDAHL INTERNAL USE ONLY
IF TLB - Another Perspective

• Key Points:
  - TLB entry includes 2 copies of the PA field.
  - One copy of the PA is accessible from IF, the other from OP.
  - Each pipe uses the PA for its TAG match.
  - To make a new TLB entry the translator has to go down IF (as well as OP) to write the IF copy of the PA.
  - The OP pipe does an explicit TLB match each cycle, whereas the IF just does it on an IF TLB Validate flow, then remember the results.
Translator Data Paths

The basic translation algorithm is:

1. Load VIRT on a TLB miss with the address that's in the pipe.

2. At the same time, load TR STD with the STO associated with the TLB miss.

3. Add the Segment Index from VIRT to the STO to generate a segment table entry address in the TOR.

4. Send the TOR down the pipe (when granted priority) to fetch the segment table entry.

5. Load the STE (i.e. the Page Table Origin) into TR FDB.

6. Add the Page Index from VIRT to the PTO, and send this address down the pipe to fetch the PageTable Entry.

7. Load the PTE (i.e. the Page Frame Real Address) into the TR FDB. Send it through prefixing and MRUT to get a physical address.

8. Take one last flow down the pipe and use the VA, STO, and PA to make a new TLB entry.

- Note: if table entry fetches (which use Real Addresses) get a TLB miss themselves, load their address into REAL, do Prefixing and MRUT, and make a TLB entry. Then continue with the original translation.
from R-cycle to P-cycle

- P CYCLE GO
- Kick signals
- P Prio Request
- R WAIT STATE
- R Load Port
- R NEXT OPCODE, FLAGS
- Port State Machine
- Opcode, Flags
- ROAR: Effective Address
- Physical Address
- STO

AMDAHL INTERNAL USE ONLY
General Port Structure

• This is a general picture of a port. Actual ports may be a subset of this.

• When a Port is loaded, Addresses (EA, STD, PA) are clocked into registers.

• Opcode and flags are also clocked in, but they may differ from the original versions, depending on the results of the flow just completed.

• A state machine keeps track of the port state, including some external events, and may even modify the opcode if required by these external events.

• Wait state is loaded to indicate what the algorithm is waiting on (e.g. on a TLB miss, a Fetch request would go into a Translator Wait state while the translation is being done). This controls a selector that monitors the various possible "kick" signals.

• Once kicked out of the wait, the port requests priority to the pipe.
Fetch Ports

• Just two states, busy and available.

• Port goes busy (is "allocated") when I-unit request gets priority into A. If the external flow completes, the port won't actually be needed.

• An independent mechanism keeps track of the order of port allocation. The oldest request is called Top Of Queue. The TOQ request is the only one that's allowed to post Status Valid (i.e. send results) to the I-unit.

• Need ________ fetch ports for no-wait service.
Fetch (TLB and TAG miss) - Simplified

I-unit flow (RX)  D A T B X X X X X X X X X X X X X X X X X W
S-unit Flow  A T B R
TLB Miss  |-
TR Busy  |------/ /------------------|
TR Kick  |-
TR Wait State  |------/ /-----|
Re-cycle Flow  P A T B R
Line Miss  |-
Move In Processing  |------/ /---------------------|
Move In Kick  |-
MI Wait State  |------/ /-----|
Re-cycle Flow  P A T B R
Status Valid  |-
Port State  -AV|------------------SVP-----------------------------------------------------|---AV-->

Multiple fetches - 1st one has line missing

I-unit flow 0  D A T B X X X X X X X X X X X W
S-unit flows  A T B R P A T B R
I-unit flow 1  D A T B B B B B B B B B X W
I-unit flow 2  D A T T T T T T T T T B X W
I-unit flow 3  D A A A A A A A A A A A A A A A A T B X W
Port 0 State  -AV|------------------SVP FL0---|A|---SVP FL3-->
- This page intentionally left blank -
Buffer Data Paths - partial

STQ 0:7 -> BUFFER DATA IN

RR FX, RR FP, RR DEC

OP (IF) Buffer

64 x 8

Rotate

OAR28:31 (IAR27:30)
OAR26:28 (IAR26:27)

8 (16) -> to OWR (IDR)
Buffer Data Paths (incomplete)

Output paths
- 8 associativities of data.
- 64 bytes read out (bit 25 used in addressing buffer, though not TAGs).
- Low order address bits used to select the correct data, then align it to send into the OWR.
- Note IF differences due to 16 byte output path and halfword alignment.

Input Paths
- RR data clocked into Store Queue on Data Ready.
- Write flow eventually writes data from Store Queue into buffer.
Store Port State Machine

- **Tracks status of write portion of stores**
  - Allocated on SV of store flow. Responsibility passed from the fetch port to the store port.

- **Basic Store Algorithm**
  - Waits for Data Ready, then starts requesting priority.
  - Once write flow gets priority (both P and A), it's done.

<table>
<thead>
<tr>
<th>Normal Store Sequence (SV on Ext flow)</th>
</tr>
</thead>
<tbody>
<tr>
<td>I-unit flow</td>
</tr>
<tr>
<td>S-unit Store</td>
</tr>
<tr>
<td>Status valid</td>
</tr>
<tr>
<td>Alloc. Store Port</td>
</tr>
<tr>
<td>Data Ready</td>
</tr>
<tr>
<td>Write Flow</td>
</tr>
</tbody>
</table>

- **Line Status State Machine**
  - Tracks presence of line in cache.
  - During MO's, address of line moved out is compared with addresses of pending stores. MO interference called on a match. Machine goes to Line Missing state.
  - On Line Missing, Store Retry flow initiated (a separate mechanism tracks priority grants).
  - Store Retry matches the PA from the Store Port with the PA in the TAGs to see if the line is in the cache. If not, a MI requested.
  - When the SR flow finally gets a match, the LS machine goes back to Line Present State.
  - **Store-ahead**: if the fetch data isn't needed (e.g. on a Store), you only need TLB match to post SV. If TAG Miss, allocate Store Port in Line Missing State.

<table>
<thead>
<tr>
<th>Store Retry Sequence</th>
</tr>
</thead>
<tbody>
<tr>
<td>Port State</td>
</tr>
<tr>
<td>Store Retry</td>
</tr>
<tr>
<td>TAG Miss</td>
</tr>
<tr>
<td>Resultant MI</td>
</tr>
<tr>
<td>Store Retry</td>
</tr>
<tr>
<td>TAG Match</td>
</tr>
<tr>
<td>Write Flow</td>
</tr>
</tbody>
</table>

4-31

AMDAHL INTERNAL USE ONLY
Set Change Bit

• Each 4K page has a Storage Key. Includes ...
  * 4 bit Access Key (often just called the Key):
    - Matched against a 4 bit key in the PSW.
    - Mismatches on stores cause a protection exception.
  * Fetch Protect bit:
    - If this is a one, protection checking applies to fetches also.
  * Reference bit:
    - Set when a storage reference is made to the page.
    - Used ____________________________
  * Change bit:
    - Set when the page is modified.
    - Used ____________________________

• System storage includes a key array associated with the MSU.

• A copy of the key is kept in the TLB for protection checking.
  - Also tracks modified state of page (i.e. Change Bit).

TLB Contents

<table>
<thead>
<tr>
<th>EA1:12</th>
<th>STO</th>
<th>DIM ID</th>
<th>P/P</th>
<th>Misc</th>
</tr>
</thead>
<tbody>
<tr>
<td>PA0:19</td>
<td>KEY0:3, C, FP</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

• On Stores to a page with C=0, need to update TLB and send Set Change Bit message to SC.

• SCB state added to state machine to track pending SCB. Priority handled separately.
Simplified Search Machine

Init value

+1

Read Flow0

Read Flow1

Read Flow2

Read Flow3

Flow0 Match

Write Flow0

Match Results

Flow0:3 x Assoc.

Read Flow0 P A T B R
Read Flow1 P A T B R
Read Flow2 P A T B R
Read Flow3 P A T B R
Flow0 Match
Write Flow0 P A T B R
Search Machine

Architectural requirements

Purge TLB - invalidate all virtual entries in the TLB for the current domain.
Invalidate Page Table Entry - given PTO and PX, set Page Table and TLB entries invalid.
Set Storage Key - store new key value into Key Array at given Real Address.

PTLB Algorithm

- A Pre/Post latch is written into TLB entries when they're created.
- This same P/P bit is included in TLB match (i.e. match the P/P bit in the TLB with the P/P latch). Normally all entries will match the latch.
- PTLB toggles the latch and posts SV. All entries will now mismatch.
- The TLB is searched in the background for entries to invalidate, based on matching:

Background search implementation:

- Read flow reads TLB contents and matches them against appropriate search parameters (both associativities).
- Write flow writes entries to appropriate new state. Two write flows per read flow.
- Match results for 4 flows accumulated to help deal with pipe latency.
- The current R and W addresses are kept in separate registers.
- Note: Abandon TLS does same thing, but forces match on all searches.

IPTE Algorithm

- Search parameter is the PA (called Search Physical Address Match).
- SPAM inhibits status valid on fetch flows (a match indicates the fetch wants the TLB entry that has a pending IPTE).
- ______ read flows for 1MB segments.

SSK Algorithm

- Same as IPTE: search for PA; SPAM match on fetch flows causes the SSK key to be used for protection check in place of the TLB key.
- Have to search \( z \leq b \).

Scrub Machine (not detailed)

- Does background fetches looking for buffer single bits to clean up.
- Looks a lot like the Search Machine.
Cycle Accessed

\[
\begin{array}{cccccc}
T & T & T & T & R \\
\end{array}
\]

TAG Entry

<table>
<thead>
<tr>
<th>PA0:19</th>
<th>Valid</th>
<th>Private</th>
<th>Modified</th>
<th>IF Pair, IF Pair Assoc. 0:2</th>
</tr>
</thead>
</table>

1 per associativity

LRU Data (R-cycle)

<table>
<thead>
<tr>
<th>0&gt;1</th>
<th>0&gt;2</th>
<th>0&gt;3</th>
<th>0&gt;4</th>
<th>0&gt;5</th>
<th>0&gt;6</th>
<th>0&gt;7</th>
<th>1&gt;2</th>
<th>1&gt;3</th>
<th>...</th>
<th>4&gt;5</th>
<th>4&gt;6</th>
<th>4&gt;7</th>
<th>5&gt;6</th>
<th>5&gt;7</th>
<th>6&gt;7</th>
</tr>
</thead>
</table>

1 per set.
TAG Contents

- **Valid bit** - TAG entry is valid.

- **PA0:19**
  - page physical address.
  - matched against PA0:19 in TLB.

- **Private bit**
  - If '1', this is the only cached copy of the line (line is Private).
  - This CPU is allowed to modify the line.
  - If '0', this line is read only (line is Public).
  - System Controller is responsible for setting this bit correctly when moving the line in.
  - OP Cache is usually about 90% Private.
  - IF Cache is almost entirely Public (see IF Pair below for exception).

- **Modified bit**
  - If '1', the line has been modified since being moved into the cache.
  - If Modified, the line state must also be ___________.
  - If Modified, need to back store to MSU eventually.
  - About 50% of Private lines get Modified.

- **IF Pair**
  - Means line is private in OP and IF has a copy at the same line address.
  - The Write flow of a Store will write both OP and IF copies.
  - Used when a line contains both operands and instructions. Prevents thrashing.
  - The Line Pair state is created by the SC when the line is moved in.
  - IF Pair Assoc. points to the associativity of the other half of the IF pair.

- **LRU data**
  - 1 bit for each pair of associativities in the set (covering all combinations).
  - indicates which associativity of the pair has been accessed more recently.
  - used to determine which assoc. is Least Recently Used (for replacement on Move-Ins).
  - one entry per set.
Buffer Data Paths

Rev. 1.591

RR FX
RR FP
RR DEC

STQ 0:7

Data Ready

DIMIR

MI DATA

BUFFER DATA IN

OP (IF) Buffer

MOVEOUT REGISTER

MO DATA

(OF only)

Path widths in bytes
(IF \( \Delta \) in parentheses)
Buffer Data Paths

16 byte (1 QW) MI path from System Storage
• 64 bytes (4 QWs) accumulated for buffer data-in.
• Bypass path from Data In Register.
  - Requested doubleword will be in first QW returned.
  - Can be bypassed to OWR while subsequent QWs are being accumulated.

16 byte MO path to System Storage
• 64 byte MO register latches data from selected associativity.
• Muxed out to System Storage over 4 cycles.
### Line Miss Flow

- **IU flow:** DATBXXXXX ● ● ● ● X X X X X X X X X W
- **SU flow:** ATBR
- **Line Miss, TLB Match:** |
- **Request to SC:** |
  - (Opcd, PA0:27, LA18:19)
- **Replacement Info:** |
  - (Assoc. Num0:2, Line state)

### Move Out Sequence

- **LMO1 Flow:** PATBR
- **LMO2 Flow:** PATBR
- **Write TAG Invalid:** |
- **MO REG:**
  - (--HL0|--HL1--)
- **SEND QW:**
  - [0|1|2|3|4|5|6|7]

### Move In Sequence

- **Load BYPass TAG ADDress:** PATBR
- **Kick Fetch Port:** |
- **MI1 FLOW:** PATBR
- **MI2 FLOW:** PATBR
- **DATA IN REG:**
  - QW0 - uncorrected
  - QW0 - corrected
  - QW1
  - QW2
  - QW3
  - QW4
  - QW5
  - QW6
  - QW7
S-unit - Fetch w/Line miss

1. External flow
   - Gets TLB match, giving us the PA.
   - Gets TAG miss - the line needs to be moved in.

2. Send Move In request to SC
   - In R-cycle, send:
     - opcode (e.g. Fetch Private).
     - PA0:27 (low order bits indicate which QW to move in first).
     - EA18:19 (_______________________________________).
   - In R+1 cycle, send:
     - Assoc# and line state of line to replace (swap).

3. Move Out
   - Initiated by SC an indeterminate number of cycles later.
   - Has three basic flavors, based on swap line state:
     - Short: line is public. Takes one flow to change it to invalid.
     - Private Short: line is private but unmodified. Takes 2 flows.
       * First flow verifies it's still unmodified (if not, convert to LMO). Second flow invalidates.
     - Long (what's shown in the diagram): line is modified and needs to be backstored.
       * Two flows needed to read out both half lines.
       * Muxing to System Storage takes 4 cycles, so the flows are spaced 4 cycles apart.
   - This is called a Swap MO.

4. Move In
   - Initiated by the SC an indeterminate number of cycles later.
   - Data transferred 1 QW per cycle, starting with the QW containing the requested data.
   - LDBYPTAGAD flow
     - Loads BYPTAGAD register with the address of the data being moved in.
     - Generates a P-cycle Kick signal to the ports.
   - Awakened by the Kick signal, the fetch port retries the fetch the next cycle.
     - BYPTAGAD is matched against the TLB, along with the TAGs.
     - If it matches (which it will in this case), the data is selected from the bypass path.
     - To save a cycle, data can be bypassed before going through ECC.
   - MI1 and MI2 flows
     - After 4 QWs fill the data in register, MI1 flow writes them.
     - Similarly, MI2 flow writes the second 4 QWs.
## Priority

**Overall S-unit priority structure:**

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1.</td>
<td>SC</td>
</tr>
<tr>
<td>2.</td>
<td>Store Dir.</td>
</tr>
<tr>
<td>3.</td>
<td>Translator</td>
</tr>
<tr>
<td>4.</td>
<td>Fetch Dir.</td>
</tr>
<tr>
<td>5.</td>
<td>I-UNIT</td>
</tr>
<tr>
<td>6.</td>
<td>Store Port</td>
</tr>
<tr>
<td>7.</td>
<td>Search Machine</td>
</tr>
<tr>
<td>8.</td>
<td>Scrub</td>
</tr>
</tbody>
</table>
Some Advanced Stuff

• Line crossers (LX), potential page crossers (PPX)
  - For LX, operand to be fetched spans 2 lines, requiring 2 buffer accesses. First access loads OWR, second access overclocks only those bytes that come from the 2nd line.
  - For PPX, future operands for the instruction may cross a page boundary (e.g. MVC). If the second page gets an exception, this needs to be determined early on in the alg.
  - I-unit sends flags indicating PPX and LX (could even be both).
  - XR Complex contains registers (1 for LX, 1 for PPX) that can be loaded with an incremented version of the BOAR (increment to next line or page, appropriately).
  - For the second flow, the appropriate XR register is selected into the P-cycle instead of the fetch port.

<table>
<thead>
<tr>
<th>LX/PPX Timing</th>
</tr>
</thead>
<tbody>
<tr>
<td>External Flow</td>
</tr>
<tr>
<td>XR loaded</td>
</tr>
<tr>
<td>LX2 or PPX2 flow</td>
</tr>
<tr>
<td>Status Valid</td>
</tr>
</tbody>
</table>

• Out of Order Fetches (OOF)
  - For SS OPs the address for the store comes from the 2nd HW of the instruction, and the address for the fetch comes from the 3rd HW. As a result it's more convenient to do the store flow first, followed by the fetch which will provide the data used by the store.
  - This fetch is called an Out of Order Fetch to the S-unit. The main difference is that the fetch can ignore SFI w.r.t. the store port containing the associated store. This associated store is logically after the fetch, so the fetch can (and must) proceed before the store completes, even if the addresses overlap.
Some Advanced Stuff (continued)

- **Continuing Stores**
  - Used when the I-unit wants to do a series of contiguous stores to the same line (e.g. Store Multiple).
  - Special Store Queue buffer can hold up to 64 bytes (8 DW), all associated with 1 port.
  - This buffer can write 16 bytes per cycle to the cache.
  - Thus, you need to do:
    * 1 store flow to allocate the port.
    * 1 write flow per QW.
  which is a lot faster than 2 flows (1 store, 1 write) *per DW*, as it would be otherwise.

<table>
<thead>
<tr>
<th>Continuing Store</th>
</tr>
</thead>
<tbody>
<tr>
<td>IU Store</td>
</tr>
<tr>
<td>IU Fetch (src data)</td>
</tr>
<tr>
<td>IU Fetch (src data)</td>
</tr>
<tr>
<td>IU Fetch (src data)</td>
</tr>
<tr>
<td>IU Fetch (src data)</td>
</tr>
<tr>
<td>Data Readies</td>
</tr>
<tr>
<td>SU Store</td>
</tr>
<tr>
<td>Alloc. Store Port</td>
</tr>
<tr>
<td>Write Flow 1</td>
</tr>
<tr>
<td>Write Flow 2</td>
</tr>
<tr>
<td>DATBXW</td>
</tr>
<tr>
<td>DATBXW</td>
</tr>
<tr>
<td>DATBXW</td>
</tr>
<tr>
<td>DATBXW</td>
</tr>
<tr>
<td>DATBXW</td>
</tr>
<tr>
<td>DATBXW</td>
</tr>
<tr>
<td>DATBXW</td>
</tr>
<tr>
<td>DATBXW</td>
</tr>
<tr>
<td>ATBR</td>
</tr>
<tr>
<td>ATBR</td>
</tr>
<tr>
<td>ATBR</td>
</tr>
<tr>
<td>ATBR</td>
</tr>
<tr>
<td>ATBR</td>
</tr>
<tr>
<td>ATBR</td>
</tr>
<tr>
<td>ATBR</td>
</tr>
</tbody>
</table>

- **Store Propagate**
  - Some SS ops (e.g. MVCL) can be used to store the same data value to all bytes of the destination field. This is called *propagation*.
  - When doing such a propagation, the Store Port only needs to be loaded with 1 DW of data. This doubleword can then be simultaneously written to multiple DWs in the buffer (up to 64 bytes) using just 1 write flow.

- **Line Store**
  - A special case of the above propagation is used by the operating system to do page clears - the same byte (typically 00) is propagated to an entire 4K page of data.
  - Often this page is cleared in anticipation of allocating it to a process. Since it isn't yet allocated, no further references will be made to it for a while, so you'd rather do the stores to MS without bringing the page into the cache and displacing more useful data.
  - Accordingly, the I-unit detects this case and generates a Linestore to the S-unit. The S-unit, in turn, passes the Linestore on to the System Controller, which will propagate the byte throughout a line of data directly in Main Store.
CPU Performance Analysis

\[
\text{MIPS} = \frac{1000}{P (\text{ns/cyc}) \times I (\text{cyc/instr})}
\]

\[I = E + D + S + M\]

**Execution**
- nominal instruction execution time, assuming no interlocks.
- Function of:

**Delay**
- delays due to pipeline interlocks, other than FDI, including:
  - I-fetch: Branch penalties, other IF disruptions due to branches
  - Pipeline interlocks: EGI, OPI, etc.
- In addition to instruction mix and µcode, this is a function of:

**Storage**
- FDI delays - waiting for buffer data.
  \[S = \mu_i \cdot M_i + \mu_o \cdot M_o + \mu_{ib} \cdot M_{ib}\]
  \((\mu = \text{miss rate, } M = \text{miss penalty, } i \text{ means IF, and } o \text{ means OP})\)
- In addition to instruction mix, this is primarily a function of:

**MP Serialization**
- Some instructions require the CPUs to synch up (get between units of operation) before the instruction is executed. Each CPU completes the current unit of operation, then waits until the instruction is executed. Thus, the Initiating CPU pays a penalty waiting for the others, and the Receiving CPUs pay a penalty each time they have to stop and wait.
  \[M = \text{Initiator Rate} \times \text{Initiator Penalty} + \text{Recover Rate} \times \text{Recover Penalty}\]
- In addition to instruction mix, M is primarily a function of:
System Storage
SONA Overview

- **System Storage is the focal point for data transfer between:**
  - CPU(s)
  - IOP(s)
  - SVP
  - System Storage itself

- **System Storage includes:**
  - Main Store Array
  - Key Array
  - XSU Controller/Array
  - System Data Switch (provides connectivity between CPUs/IOPs and data)
  - System Controller (address and control focal point)
## SC Opcode List (condensed)

### SU MS Data Ops
- **FETCH** - 4 flavors (Public/Private x fetch/prefetch)
  - DECLARE PRIVATE
- **S-UNIT LMO**
  - LINESTORE - 4 flavors
  - COPY REASSIGN OPCODES (2)
  - RELEASE CACHE LINE

### SU MP & Key Ops
- **PURGE** - 7 flavors
- **SSKNP** Set Storage Key Non Propagate
- **SCRB** Set Reference and Change Bit
- **RRB** Reset Reference Bit
- **ISK** I-Unit Key Fetch
- **TLB KEY REQUEST** (S-Unit Key Fetch)
- **SSK** Set Storage Key Propagate
- **SWK** Swap Storage Key
- **IPTE** - 13 flavors
- **LDMRUMSGprop** PROPAGATE LOAD MRUT

### XSU Ops
- **FGOUT MS ADRS** Page-out Mainstore Addr
- **FGOUT XS ADRS** Page-out Extended Storage Addr
- **FGIN MS ADRS** Page-in Mainstore Addr
- **FGIN XS A**
- ~21 other XSU Ops

### IOP Ops
- **FETCH** - 5 flavors
- **RELEASE LOCK**
- **STORE** 5 flavors

### Internal MS Ops
- **MAINSTORE WRITE**
- **MAINSTORE SCRUB**
SC Opcodes

• All requests to System Storage come through the SC.

• SC Design is oriented around the Data Ops, esp. Fetch.
  - A Fetch request from the S-unit leads to a _____________.
  - May say Public is OK, or may ask for it Private.

• LMO
  - the SC initiates this by sending LMO pipeflows down the S-unit pipe.
  - These flows return the address and data to System Storage in the form of a LMO "request" to the SC.

• MP and Key Ops
  - MP Ops involve propagation of the operation to other CPUs.
  - Key Ops operate on the Key Array.

• XSU Ops
  - Data transfers between the XSU and the MSU.

• IOP OPs
  - Fetches are similar to S-unit OPs.
  - Lacking a cache, the IOP does Stores directly to the MSU. Similar to a LMO.
  - To do Read-Modify-Write, the IOP can lock a line. The SC maintains this lock.

• MS Write
  - The actual writing of data to the Main Store.
  - Done in the background, thanks to the Move Out Queue.
Basic System Storage Concepts

- The I-bus chooses the highest priority request and loads it into the SC ports
  - A request includes a packet with enough information to process the request.
  - If the I-bus doesn't accept a request, it's up to the requestor to try again.

- The SC Ports are the focal point of System Storage
  - Central mailbox containing everything dealing with a request, including:
    * The initial request.
    * Current status of processing.
    * Any data associated with the request.
  - FIFO Queue: I-bus loads Bottom of Queue Port, servers process Top of Queue.

- Servers provide the control to process the request
  - Send addresses and control to the arrays.
  - Transfer results to the requestor.
  - Process requests independently. Communicate with each other through status bits.
  - Each server proceeds at its own pace. Each server has its own TOQ.

- Arrays provide storage for Data and Keys

- Basic actions needed to complete an SU Fetch:
  - ________________
  - ________________
  - ________________
  - ________________
  - ________________
  - ________________
  - ________________

- Move Out Queue provides buffering for writes to Main Store
  - Holds Move Out data while Main Store is busy doing the read.
  - Maintains data in a queue, does write to MS in background.
  - Analogous to ________________
System Storage Basic Blocks

- **I-bus**
  - Selects request to load into the ports.

- **SC Ports**
  - Provide storage for the initial request.
  - Separate output selectors for each server, allowing servers to go at their own pace.
  - Also provide individually writeable status bits allowing each server to post its status.
  - 8 ports in an SS system, addressed by Port ID.

- **Data Buffers**
  - Conceptually an extension of the SC Ports, but often referred to separately.
  - a.k.a. Port Data Buffers (PDBs).
  - Swap/Store Buffers hold Swap LMO data.
  - Fetch Buffers hold MI data.

- **Key Ports**
  - Conceptually an extension of the SC Ports.
  - Hold data read out of Key Array.

- **Arrays**
  - Main Store, Key, and Move Out Queue.
  - MS and Keys implemented on BLCs, MOQ is in SIMTEC.

- **Servers**
  - Each server has its own Port ID to read out a request from the SC Ports.
  - **Main store Request Server**: sends address and control to the MSU.
  - **Key Server**: sends address, control, and data to the Key Array.
  - **Data Integrity**: searches all caches for data, initiates DI Move Out if needed.
  - **MO Server**: initiates Swap MO based on replacement info from S-unit.
  - **MI Server**: initiates MI flows to S-unit and controls data transfer out of Fetch Buffers, based on status posted by ports. In general, wraps things up for a request.
  - **MOQ Search Server**: searches MOQ for data.
  - **MOQ Add Server**: transfers MO data from Ports/Buffers into MOQ.
  - **MOQ Transfer Controller**: initiates transfer of data from MOQ to MS, via SC ports. Not a "server" as it doesn't process SC Port requests.

- **Interface Controllers**
  - Provide actual control and address interface to the S-unit pipeline.
  - MI, MO, and DI (path not shown) may all contend for a given IFC.
Port Structure

- **Looks like a multi-ported RAM.**
  - Has data in, write address (Write Port ID).
  - Multiple read paths provided, each with a separate read address (Read Port ID).
  - Read paths customized: only provided for servers that need them.
    - A given server may have several selectors covering different bits. This allows the server to read different bits at different times.
    - e.g. SC Ports
      - Data is the original request.
      - Write Port ID is ____________
      - Read paths for every server.

- **Each port includes multiple pieces which are all variations of this structure.**
  - SC Ports (original request)
  - Swap Address Buffers
    - The PA of the line to be swapped out on a fetch is sent over much later than the initial request and is stored in a special section of the port called a Swap Address Buffer.
  - Various status bits
  - Data buffers (Swap and Fetch data)
  - Key Ports

- **Each server processes like a FIFO queue**
  - Different servers may be on different requests at the same point in time, but each server cycles through the Ports on a FIFO basis.

- **Status Files**
  - Each server sets one or more status bits, including:
    - Done bits (1 per server) indicating server is done with request. Stays set until port is overclocked with a new request.
    - Timing bits: Provide timing information to other servers.
    - Results: Specific results obtained by the server.
  - Separate Write Port ID is provided for each server's set of status bits.
  - Read paths customized for each bit
I-bus

- **Highest priority request (assuming no busies) accepted into I-bus. Includes:**
  - Opcode
  - Physical Address
  - Logical (Effective) Address (S-unit only)
  - Dimension ID
  - Swap Line State (S-unit only)
  - Miscellaneous stuff

- **Priority tree:**
  1. Long Move Out
  2. MOQ HI
  3. MS Patrol
  4. eXpanded Storage Controller
  5. SVP
  6. IOP - ties broken by toggle latch
  7. S-unit (non-LMO) - ties broken by toggle latches
  8. MOQ LO

- **Busies used to protect resources**
  - MS Element Busy
    * Based on PA0:2
    * Protects MS RAMs from a second access while first is still busy.
  - DIEC Busy
    * Based on PA21:24
    * Prevents multiple requests to same line from being in SC at the same time.
    * Stands for Data Integrity Equivalency Class.
    * Pronounced DEEK.
    * If the winning request has a conflict with a busy, it isn't validated in the SC ports.

- **Most requests go from I-bus into SC ports. Exceptions include:**
  - Swap LMO: goes into Swap Address Buffer for originating port.
  - MS Write requests (from MOQ only) go through Write Buffers.
  - XSC requests go through a dedicated XSC port.

- **Request buffers hold pending CPU requests until they're accepted.**
  - a.k.a. Holding Registers.

- **IOP and SVP addresses may be absolute, requiring MRU Table access.**

- **NOTE: "S-unit" and "CPU" used interchangeably.**
SU/SC Interface - DIEC Busies

Rev. 1, 5/91

Nominal MI Timing

| Fetch Flow | P A T B R |
| Tag Miss   |         |
| Send MI Req| 1-1      |
| Latency Count | 5 4 3 2 1 0 |
| DIEC Busy  |         |
| LdBypTagAddr | P A T B R |
| MI DIEC Kick | 1-1    |
| Fetch Flow  | P A T B R |

5-13.1

AMDAHL INTERNAL USE ONLY

AM 3493
SC-SU Interface - DIEC Busies

SC Sends a copy of DIEC Busies to the SU
- these are used to kick fetch ports out of DIEC Busy Wait states.
- also used in the B-cycle to determine whether or not to send a request to the SC.

Fetch Flow gets Line Missing:
- If DIEC Busy is already on (in the B-cycle) no request is sent and the flow goes directly into a DIEC Busy Wait State.
- If DIEC Busy isn't on then the request has a chance to get into the SC:
  1. Send a request to the SC.
  2. Wait a while so the DIEC Busy has time to come on.
     NOTE: implemented by going into DIEC Busy Wait and forcing DIEC Busy with a counter.
  3A. If it doesn't come on then assume the request failed and recycle.
     NOTE: it could succeed out of _________ during the recycle.
  3B. If it does come on, then assume the request succeeded and wait in DIEC Busy.
     NOTE: it could've actually failed and the DIEC Busy is due to a different request.
- Recycle when the DIEC goes available, or when kicked by the LdBypTagAddr flow. This kick is DIEC specific.
MSU Data Paths
Rev. 1, B91

MS Addressing

<table>
<thead>
<tr>
<th>Physical Address</th>
<th>0</th>
<th>1:2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>24</th>
</tr>
</thead>
<tbody>
<tr>
<td>1Mx1 SRAMs</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Side Select (DS)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4Mx1 SRAMs</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Side Select (DS)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>RAM Address</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
MSU Data Out Paths

• 1 MB RAMs dotted in pairs to create 2Mx1 structure
  - PA5:24 addresses the RAMs
  - PA4 selects RAM to enable

• 2M x 128 byte lines per element
  - 128 RAMs (64 pairs) per array card = 2M x 64 bits per card
  - 20 array cards per element = 2M x 1280 bits = 2M x 128 bytes + ECC
  - Can read or write an entire line at a time

• 4 elements per side
  - 4 x 2M x 128 = 1 GB/side

• Data Out MUX (16 to 1) selects source element and muxes quarter lines (32 B) into ECCDIR.

• ECC speed matches to load Fetch Data Buffers
  - MSU runs at 1/2 speed clocks, SDS is on full speed clocks.
  - ECC (on SDS) selects 16 bytes/cycle from 32 byte ECCDI register.
MS Request Server

• Next active request loaded into MS Request Register.

• Request sent on to MSU
  - PA1:24
  - Port opcode decoded to 1 bit, plus a Valid bit.
    * 1 bit indicates Read or Write.

• State machines informed of the request
  - Busy Control tracks timing of busies for I-bus.
  - Fetch Service Queue:
    * Tracks fetch requests that have been sent out.
    * Initiates Muxing out when data ready.
  - Mux Control controls Data Out Mux and ECC on MSU and SDS.

• Status bits posted, as appropriate, to inform other servers of progress.

Sample MRS Timing
(SU Fetch, 25 ns RAMs)

<p>| | | | | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
</tr>
<tr>
<td>9</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
</tbody>
</table>

SU FETCH: 123456789012345678901
PA: 123456789012345678901
AT: 123456789012345678901
BR: 123456789012345678901

Hold Reg: |------------------------|---|
SC Port:  |------------------------|---|
MS Req Reg: |---|---|
EL Address: |1-2-3-4-5-6-7-8-9-0-1-2|
EL Data Out: |-------------|
Data Pending Set: |-------------|
ECC DIR: |QL0|QL1|QL2|QL3|
ECC DOR: |Q0|Q1|Q2|Q3|Q4|Q5|Q6|Q7|
Data Bftrs: |Q0|Q1|Q2|Q3|Q4|Q5|Q6|Q7|

(Q = 1st QW w/o ECC)
MO Server

- Based on Replacement Line state, initiates Swap MO.

- 5 state machines pipelined together:
  - Read State Machine:
    * Analyzes request to determine if a Swap MO is needed.
    * If so, requests IFC priority.
    * When given IFC grant, passes control to Delay1 state machine.
  - Delay1-2 State Machines:
    * Each machine counts 3 cycles, then passes the request on to the next stage.
  - Write State Machine
    * Monitors MO status for line locked or other problems.
    * If LMO, initiates Data control state machine.
    * Posts status.
  - SDS Data State Machine
    * controls transfer of data into Swap/Store Buffers.

<table>
<thead>
<tr>
<th>MO Timing Example</th>
</tr>
</thead>
<tbody>
<tr>
<td>SU Fetch</td>
</tr>
<tr>
<td>Hold Reg.</td>
</tr>
<tr>
<td>SC Port</td>
</tr>
<tr>
<td>Repl. info in SC</td>
</tr>
<tr>
<td>MO Stages</td>
</tr>
<tr>
<td>Req. to IFC (IFC)</td>
</tr>
<tr>
<td>Status posted</td>
</tr>
</tbody>
</table>

5-19
QP Cache Search Possibilities

- In a QP system, the requested data could be in

\[
\begin{align*}
4 \\
\times 8 \\
\times n \\
\times 2 \\
= 64n
\end{align*}
\]

- Each value of EA18:19 is referred to as a SLOT in the DI Server
  - at most there are ____ matches per slot.
  - ____ total matches possible.
DI Server

- Responsible for:
  - Finding any cache copies of requested line.
  - Initiating Move Outs, as appropriate, to ensure all caches follow the DI rules.
  - Bypassing data into Fetch Data Buffers for DI LMOs.

- Focal point is TAG2
  - Copy of all S-unit TAGs of a QP (or of 1 side of a DS system).
  - Each entry includes a valid bit, pub/priv bit, and PA:19.
  - Organization allows 1 slot (i.e. EA18:19 value) to be accessed at a time.
  - Accessed via pipeline.
    * Priority
    * Match
    * Results
  - MI server can access to update during a Move In.

- TOQ request loaded into available window
  - Window holds request info for duration of DI processing.
  - 2 windows.
  - Saves long path of going out to SC ports and back to read info.

- Search stage initiates 4 flows to search all EA18:19 values

```plaintext
Slot 0    P M R
Slot 1    P M R
Slot 2    P M R
Slot 3    P M R
```

- Inspect stage analyzes match results
  - Results stored in DI Match registers.
  - 8 registers per slot (since only 1 associativity per cache set can match).
  - Inspect stage MUXes out 1 slot of the DIMRs at a time for analysis.
  - If MO's needed, MO stage is informed.

- Move Out stage
  - Requests priority to IFC, which in turn will control interface to S-unit.
  - Monitors status from S-unit to see if line is locked (or modified).
  - Controls data transfer into FDBs, posts status, and requests TAG2 access to update.
  - Addresses (for LMO and TAG2 update) come from window.
CPU Interface Controller

Rev. 1, 8/91

MI, MO, DI Servers

Requests

Grants

Priority Determination

MI Req. Pkg.

MO Req. Pkg.

DI Req. Pkg.

MIFM

MOFM

Remap

Request to S-unit

Request Package

<table>
<thead>
<tr>
<th>SU Opcode</th>
<th>PA</th>
<th>EA18:19</th>
<th>Assoc. #</th>
<th>Flags, Misc.</th>
</tr>
</thead>
</table>

I

G

P-1
Interface Controllers

- Responsible for controlling the interface into S-unit pipeline.
  - Selects between DI, MO, and MI servers for priority.
  - Generates subsequent flows of multiple flow algs.
  - 1 IFC per S-unit.

- Pipelined.
  - I Fe IFC priority
  - G Grant
  - P-1 send request to SU (in P-1 cycle of S-unit pipe)

- DI, MO, and MI servers contend for priority into IFC in the I-cycle.
  - Highest priority request gets its request package sent down the pipe.
  - If this request is for a multiple flow algorithm:
    * Flow Machines (one each for MI and MO) are fired up to generate the follow on flows.
    * These follow on flows are highest priority.
    * The opcode and low order address bits are modified to form the follow on flows, and saved until needed.

- Request package selected and sent to S-unit.
  - Opcode
  - PA (exact bits depend on operation)
  - EA18:19
  - Associativity #
  - Flags, Misc.

<table>
<thead>
<tr>
<th>MI Server Req</th>
<th>---</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ld BypTag Flow</td>
<td>I G P-1 P A T B R</td>
</tr>
<tr>
<td>Start MIFM</td>
<td>---</td>
</tr>
<tr>
<td>MIFM Busy</td>
<td>-------------------------------</td>
</tr>
<tr>
<td>MIFM Req.</td>
<td>---</td>
</tr>
<tr>
<td>LMI1 Flow</td>
<td>I G P-1 P A T B R</td>
</tr>
<tr>
<td>MIFM Req.</td>
<td>---</td>
</tr>
<tr>
<td>LMI2 Flow</td>
<td>I G P-1 P A T B R</td>
</tr>
</tbody>
</table>

Sample IFC Timing - Move In

AMDAHL INTERNAL USE ONLY
Address, Data Flow

Control Flow
Long Move Out Process Flow

1. Initiated by MO (Swap) or DI (DI LMO) Servers.

2. Interface controller sends ___ flows down S-unit pipe.

3. S-unit sends out data from cache and PA from TAGs
   - also sends Valid bit (line may be a Ghost).
   - also sends modified bit (on DI MOs, TAG2 only knows pub/priv. If line is unmodified, don't write MOQ).

4. The MO is loaded into the Ports.
   - The PA is sent over the I-bus into the Swap Address Buffer.
     * PA only needed on Swap LMO. On DI LMO it's ____________________________.
   - The data is loaded into a Data Buffer.
     * __________________ for DI LMO.
     * Swap/Store Data Buffer for Swap LMO.

5. MOQ Add server then puts the Move Out into the MOQ.
   - DI or MO server post status bits telling Add Server the MO is there.
   - The Add Server loads the PA into a MOQ TAG.
   - The Add Server loads data into the MOQ array.
   - This is the end of foreground processing (i.e. original SC Port request is now done).

6. MOQ Transfer Controller cycles through MOQ emptying out pending requests.
   - Loads data into MS Data In Register.
   - Reads PA out of TAG and sends to I-bus as a MS Write Request.

7. I-bus loads this request into a Write Buffer (instead of using an SC Port).

8. The MS Server generates a write request to the MSU.
   - It sends the PA, plus an opcode saying to do a write.
   - The data is already set up in the MSDIR.
MOQ Organization

- 32 deep FIFO queue, 1 line per slot.
  - MOQ can hold 4K of data.

- Data Organization
  - Can Read or Write 32 bytes (1 QL) per cycle.
  - 4 QLS per Slot.
  - Address includes 5 bit Slot Number and 2 bit QL number.
  - Implemented in RAMs.

- TAG Organization
  - 1 TAG per slot.
  - Includes:
    * PA0:24
    * Valid bit
    * Misc. bits
  - Implemented in latches.

- Data and TAG Pipelines are accessed independently.

- Non-pipeline control provided by:
  MOQ Add Server
    - Writes data/address into MOQ.
  MOQ Search Server
    - For Fetch requests, searches MOQ to see if data is there.
  Transfer Controller
    - Transfers data out of MOQ and sends Write request to MS.
MOQ Data Pipeline

- **Priority determination and address selection**
  - Add Server contends to 
  - Transfer Controller contends to 
  - Search Server contends to 

- **Transfer address** (i.e. Slot ID, QL#) to SDS.
  - For Transfer or Search, nothing else happens during the T cycle.
  - For Adds, the Add Server reads data out of the Data Buffers to prepare to do a write.
    * This control can be done directly by the Add Server as no other Data Pipe contenders use these paths. Thus, these latches aren't strictly associated with a cycle point.
    * Note that 16 bytes are read out per cycle and concatenated to form 32 byte QLs. Because of this, the Add Server will only try to do a Write flow every other cycle.

- **Distribute address** within SDS.

- **Access MOQ Strams.**
  - This cycle point is the address/data in latches in the STRAM macro.

- **Results available.**
  - The 32 bytes can be MUXed over to the MSU 16 bytes at a time, but each 16 bytes takes 2 cycles to transfer, so the Transfer Controller does reads every 4 cycles.
  - Note the MOQ Bypass path back to the Fetch Data Buffers.
MOQ TAGs

- TAGs implemented in latches, allowing some special capabilities.
  - Can match against all 32 TAGs in parallel.
  - Can do concurrent Reads and Writes.
  - Transfer Controller has its own selector to read TAGs.
  - Valid bit Resets dedicated to Transfer Controller (not shown).

- Transfer Controller
  - Loads data into MSDIR, then
  - Once accepted into the I-bus, it resets the Valid bit.

- Pipelined TAG Access for Add and Search Servers
  - PMR pipe
  - Transfer Controller owns all the resources it needs, so it doesn’t need pipe access.

Priority cycle
  - Search and Add Servers contend for pipeline access.

Match cycle
  Search PA register
  - Matched against all 32 TAGs in one cycle.
  - Used by Search Server to
  - Also used by Add Server to
  - Search results encoded and latched.

Write PA register
  - Contains PA to be written into next MOQ slot.
  - Owned by Add Server, so it’s not strictly part of the pipe.

Result cycle
  - Match Results available for examination.
Swap LMO MOQ Add Timing

<table>
<thead>
<tr>
<th>LMO1</th>
<th>PATBR</th>
<th>PATBR</th>
</tr>
</thead>
<tbody>
<tr>
<td>LMO2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>QWs in SDB</td>
<td>0 1 2 3 4 5 6 7</td>
<td></td>
</tr>
<tr>
<td>Write QL0</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Write QL1</td>
<td></td>
<td>PATAR</td>
</tr>
<tr>
<td>Write QL2</td>
<td>PATAR</td>
<td></td>
</tr>
<tr>
<td>Write QL3</td>
<td>PATAR</td>
<td></td>
</tr>
<tr>
<td>PA SwpAdrBfr</td>
<td></td>
<td></td>
</tr>
<tr>
<td>MOQ TAG Update</td>
<td>PATAR</td>
<td></td>
</tr>
</tbody>
</table>

Search Timing (match case)

<table>
<thead>
<tr>
<th>Fetch in SC</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Search Flow</td>
<td>PATAR</td>
</tr>
<tr>
<td>Match found</td>
<td>PATAR</td>
</tr>
<tr>
<td>Read QL0</td>
<td>PATAR</td>
</tr>
<tr>
<td>Read QL1</td>
<td>PATAR</td>
</tr>
<tr>
<td>Read QL2</td>
<td>PATAR</td>
</tr>
<tr>
<td>Read QL3</td>
<td>PATAR</td>
</tr>
<tr>
<td>*QWs in FDBs</td>
<td>0 1 2 3 4 5 6 7</td>
</tr>
</tbody>
</table>

Transfer Controller Flows

<table>
<thead>
<tr>
<th>Read QL0</th>
<th>PATAR</th>
</tr>
</thead>
<tbody>
<tr>
<td>Read QL1</td>
<td>PATAR</td>
</tr>
<tr>
<td>Read QL2</td>
<td>PATAR</td>
</tr>
<tr>
<td>Read QL3</td>
<td>PATAR</td>
</tr>
<tr>
<td>*QWs in MSDI</td>
<td>1-0-1-1-1-2-1-3-1-4-1-5-1-6-1-7-1</td>
</tr>
<tr>
<td>I-bus Req</td>
<td></td>
</tr>
</tbody>
</table>

* Latch points may be missing from Block Diagrams. Timing diagrams should be correct.
MOQ Algs

Add Server
- The MO Server posts a status bit indicating the Swap LMO1 and LMO2 flows have (or soon will have) loaded the Swap/Store Data Buffers.
- Kicked by this, the MOQ Add Server will initiate 4 MOQ Write flows to write the data into the MOQ.
- In parallel with the write flows, the Add Server writes the PA (from the Swap Address Buffer) into the corresponding MOQ TAG.
- The add server also checks the TAGs for an older copy of the line to invalidate.

Search Server
- Fetch requests need to check the MOQ to see if the data's there.
- One TAG search flow examines all 32 TAGs.
- In the case shown it happens to get a match.
- The Search Server requests priority for 4 data read flows and loads the data into the Fetch Data Buffers, for subsequent delivery to the requester by the MI server.

Note - the timing diagram takes into account latch points that aren't shown in the block diagrams.

Transfer Controller
- Having found a valid MOQ entry in the background, the Transfer Controller initiates 4 read flows to read out the entry.
- The data is muxed 16 bytes at a time to the MSU, and each 16 byte transfer takes 2 cycles, so the TC does read flows every 4 cycles.
- In the MSU, the data is loaded into 1 of 2 MS Data In Registers.
- Once all the data is read out, the TC requests I-bus priority to send a MS Write request to the MS Server.
  * Note - this request is sent before all the data is actually in the MSDIR. The timing is such that, by the time the MS Server does the actual write, all the data will be there.
Key Server

Request Generation Stage
- Recodes SC opcode to an internal key opcode. Includes (among others):
  * SRB   Set Reference Bit.
  * SRCB  Set Reference and Change Bits.
  * STORE Write entire key.
  * FETCH Read and return key.
  * PchkF Do Fetch Protection Check (using provided key). If OK, Set Ref. Bit.
  * RRB   Read and return key, reset the Reference Bit.
- Fires up Key Array Controller with new opcode (except on NOP).

Key Array Controller
- Controls chip selects and write enables
- For opcodes doing both a read and a write, write enable is delayed until read is done.

<table>
<thead>
<tr>
<th>PchkF Timing</th>
</tr>
</thead>
<tbody>
<tr>
<td>-1-2-3-4-5-6-7-8-9-0-1-2-3-4-</td>
</tr>
<tr>
<td>Key RAM Address</td>
</tr>
<tr>
<td>Key RAM Data</td>
</tr>
<tr>
<td>Data Out Reg</td>
</tr>
<tr>
<td>Protection Chk</td>
</tr>
<tr>
<td>Key RAM Write En</td>
</tr>
</tbody>
</table>

- Passes results on to Response Stage State Machine
- Array is _________ deep.
- Key Array runs on half-cycle clocks.

Response Stage
- On Key Reads, key is loaded into port.
- Sets status bits describing request results.
MI Server

SC Ports

Data Path Availability

from IFCs
to DI

to IFC, Data Sequencers

Status File

DIEC Busy Reset
to I-bus

Key Port Control
to Key Chip

Handshake

from Key Chip

Handshakes

from DI, IFC

to Status File
MI Server

- **Pulls it all together and returns results to requestor.**
  - Because the MI server is the one that "wraps it all up", it's the prime reader of status bits.

- **D-cycle**
  - Waits for appropriate status bits to be set.
  - Based on status bits, initiates response by kicking off I-cycle.
    * For S-unit requests, maps the SC opcode to an S-unit opcode.

- **I-cycle**
  - Sends requests to various resources:
    * IFC to make S-unit requests.
    * Based on IFC grant, kicks off a Data Sequencer on the SDS to control data muxing.
    * DL to update TAG2.
    * Key chip to read key results out of the Key Ports.
    * I-bus to reset the DIEC busy.

- **R-cycle**
  - Receives handshaking from above resources to make sure there wasn't an error.
  - If there was, MI stops processing requests until S-code can fix things up.

- **W-cycle**
  - Sets MI Done in the Status File.
## Sample SU Fetch Request Timing

**semi-accurate**

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>S-unit</td>
<td>PATB</td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Line Miss</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
</tbody>
</table>

### I-bus

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Holding Reg</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>SC Port</td>
<td></td>
<td></td>
<td>PATB</td>
</tr>
</tbody>
</table>

### MS Server

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Req. Reg.</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>ELAR</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Fetched Data Bfr</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Status Posted</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
</tbody>
</table>

### DI Server

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Window 0</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>TAG2 Search</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>No match</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Status posted</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
</tbody>
</table>

### MO Server

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Repl. info in SC</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>MO Stages</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Req. to IFC</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Status posted</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
</tbody>
</table>

### IFC

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>LM01 Flow</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>LM02 Flow</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>LdBypTAG</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>MI1</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>MI2</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
</tbody>
</table>

### MOQ

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Search</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>No Match</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Srch Status</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Swap ADDR Buffr</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Swap Data Buffr</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Wr QL Flows</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>MOQ TAG Update</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
</tbody>
</table>

### Key Server

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>SRB Flow</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Status Posted</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
</tbody>
</table>

### MI Server

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>MI Pipeline</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
<tr>
<td>Req. to IFC</td>
<td></td>
<td>PATB</td>
<td></td>
</tr>
</tbody>
</table>

---

5-40

AMDAHL INTERNAL USE ONLY

AM 3493
Expanded Storage Architecture

Main Store

Expanded Store

Page Out

Page In

Page Address

Expanded Storage Block Number

Main Store Physical Address

Expanded Storage Addressing
Expanded Storage Architecture

• Large, slow, dense storage
  - In SONA, 4 GB/side (4 Mb DRAMs). Later, 16 GB/side (16 Mb DRAMs)
  - Larger, slower than MS. Smaller, faster than disk.

• Fundamental unit is a page (4K)
  - In Sequoia a 32 bit Expanded Storage Block Number points to a page.
  - Allows up to 16 TB of data to be stored.
  - Since SONA maxes out at $\frac{7}{4}$ GB, bits $\frac{9}{4}$ are always zero.

• Page Ops transfer a page between Expanded Store and Main Store
  
  **Page In:** Copies a page of data from XSU to MSU.
  **Page Out:** Copies a page of data from MSU to XSU.

  - Instructions include a Main Store address and an ESBN.
  - Operation is synchronous; the CPU waits until the transfer is complete.

• Naming confusion
  - IBM calls it Expanded Store.
  - Many people call it Extended Store.
  - Both ESU and XSU are used as acronyms.
XS Address and Data Paths

• XSC Port
  - I-bus sends XS requests to dedicated XS Controller Port.
  - A Page Op requires 2 I-bus commands, one for each address:
    * The ESBN goes through the ERU Table, then into the XAB.
    * The Main Store PA bypasses the ERUT and goes into the IAB.

• Address Buffers
  - Two sets of buffers, one for I-bus Addresses (IAB) and one for XSU Addresses (XAB).
  - IAB/XAB each provide 1 dedicated buffer for each possible requestor (CPU, SVP, IOP).
    * Each Requestor will only have 1 Page Op pending at a time.
  - One operation is handled at a time. Any others wait in the ABs until processed.
  - ABs are not a queue. The XS Controller processes them round robin.

• Data Buffers
  - Buffering provided for 4 lines of data.
  - Data paths are 1 DW (8 bytes) wide.
    * XSU Array path takes 2 cycles to transfer 1 DW.

• Algorithms
  Page In
  1.
  2.
  3.
  4.
  5.
  6.

  Page Out
  1.
  2.
  3.
  4.
  5.

• Other Algs
  - SVP can do line fetches/stores.
  - IOP Page Ops provided in anticipation of asynchronous page in/out.
  - MSU-MSU copies also implemented to support dynamic reconfiguration.
  - Background refresh done for DRAMS.
System Storage DS Concept

• **System Storage is DS Focal Point**
  - All cross-coupling done here.
  - CPUs, IOPs only talk to their local SC/SDS.
  - Data can be on either side of a DS system.

• **Approach is to cross-couple at key points within System Storage**
  - I-bus
  - Status File
  - Main Store
  - Some other cross-coupling isn’t shown.

• **I-bus**
  - Same request wins on both sides.
  - DIEC and other busies must be kept in synch.

• **Status files**
  - Status bits needed by the other side (e.g. DI search results) are cross-coupled.
  - Done by the status file itself.
  - Means duplicating these status bits to provide storage for status from local and remote.

• **MSU**
  - Dual ported, either SC can access both MSUs.
    * MSU inputs have a selector to pick which side drives them.
  - Two data out paths, one for each SDS.
I-bus DS Design

• Four cycles added in DS mode to allow for cross-coupling
  - First two latch points "early-up" the request to send to the other side.
  - Next two latch points "normalize" the late request from the other side.
  - Winning request(s) loaded simultaneously on either side.

• Remote requests loaded into Cross Couple Ports
  - Local requests go into ReQuest Ports.
  - Allows some servers to not bother even seeing remote requests.
    * They just look at the RQ ports.
    * Includes ____________
  - Other servers need to alternate between the two sets of ports.
    * Includes ____________.

• Element and DIEC busies are same on each side
  - Both I-buses set the busies at the same time when loading the request.
  - I-bus-cross-couples the resets so they go off at the same time.
Data Cross Coupling in DS

- Move-outs go to MOQ on same side as the requesting CPU.
  - Bypass to Remote provided for DI MOs.
  - Needed if ________________________________
  - Same bypass path used for MOQ bypass to remote.

- MSU 'data-in dual ported.
  - MOQ loads Data in Register of appropriate MSU.
  - Then sends MS Write request through I-bus, into its local Write Buffer.
  - The local MRS will send the address and write enable to the target MSU.

- Two data out paths provided.
  - One for each SDS.

- Keys (both address and data) cross-coupled just like MSU.
Things to Remember from the Class

1.

2.

3.

4.

5.